Microservice systems have very different characteristics to previous architectures. Whereas the system as a whole may live for a long time, its components – containers – do not. They live and die quickly.
This new situation is very challenging for many areas of software engineering. One of them is monitoring – how can an operator make sure that a system remains healthy if existing monitoring tools cannot deal with this new reality?
In this post I will explore the challenge, and present my ideas on how we can apply the knowledge of physics and biology to build more intelligent monitoring solutions.
The Challenge – Invisible Containers
Let’s have a look at Service X, which is composed of 3 short-lived containers. On the left you can see their real performance over time, and on the right we can see what has been recorded by a traditional monitoring system:
The typical frequency of sampling – once per 1-10 minutes – which was suitable for monitoring physical servers and virtual machines, is clearly insufficient to monitor containers. Some of them, like Container B in this example, are so short lived that they go unnoticed. This means that if we use them to build a large micro-service system, a large part of it would remain unmonitored, un-monitorable, or both.
To fix this, we might think about increasing the frequency of sampling. Then the situation looks like this:
But what seems like a working solution – Container B is now visible – does not scale to production, particularly in environments where containers serve a single request/job and then disappear. Such ephemerality would require sampling at the frequency > 1 Hz (more than one sample per second), which would produce an enormous amount of data. I dare to say that the monitoring system would be more complicated than the system it monitors.
The problem must be solved in a different way.
A true solution can be found if we take a close look at evolutionary trends, and analyze where they point us to. That is the reason why in this section I am going to look at the history of monitoring solutions with the focus on their intelligence. More specifically, on contextual awareness – a notion which will be explained in the process.
Class 0 (Context-Less Monitoring)
I believe that a monitoring solution becomes functionally complete if it is able to perform two operations: collect information and alert if there is an anomaly.
If we accept this, then ping – a well-known administration utility – turns out to be the simplest functionally complete implementation that satisfies the criteria: it collects information on the round-trip time of ICMP packets sent to a given host and if the time exceeds a predefined threshold value (the timeout), it reports this as an anomaly.
This approach has a tendency to generate many false alarms, as an alert is sent every time the threshold is exceeded, which is not always an indication of a serious problem. In the past, however, more tools that relied solely on thresholds existed. I will refer to them as Class 0 solutions, as they began the history of monitoring solutions. Soon, though, more advanced solutions appeared.
Class 1 (Stateful Monitoring)
Nagios was one of the first representatives of the new class. The difference between it and the previous generation lay in the way it detects anomalies. More specifically, in the amount of contextual information that is used in the process.
First, let’s again have a look how Class 0 solutions like ping operate. The decision logic is in fact quite primitive: they look at the state of the infrastructure as it is at the moment of observation, then the information is compared with a threshold value to decide if the state is healthy or not. In other words, raw measurements without any additional context are sufficient for the systems to react, and the reaction is very deterministic.
Nagios was different. Starting from the beginning, it offered the mechanism of state flapping detection, which means that crossing the threshold value was not anymore a sufficient reason to trigger an alert. Instead, Nagios compares the current sample with the information on past values of the measurement, to ensure that this is truly an anomaly that deserves attention. The memory of past samples provides historical context, which is a valuable source of information for all sorts of advanced anomaly detection algorithms. In more advanced implementations the memory could be also used for pattern detection, and if such patterns occur (e.g. regular nightly backups that consume lots of resources) the decision logic could decide that something, which at first glance looks like an anomaly, is in fact harmless and alerting operators brings no benefit.
The easiest way to explain the difference between Class 1 and Class 0 is to see what happens when measurements start to oscillate around threshold value (so called state flapping). A Class 0 system reacts every time the threshold is exceeded, which floods operators with alerts. Class 1 system uses historical context to realize that the state is unstable, and only from time to time produces an aggregated alert.
The awareness of historical context results in more intelligent decisions and therefore more accurate alerting. The advantage is so significant that nowadays almost all market products work in a similar way: both traditional Ops tools (Icinga, Shinken, Zabbix, Sensu) and container-friendly products (Prometheus, New Relic, Dynatrace, Sysdig Cloud, Scout, DataDog). I will call the type of monitoring Stateful Monitoring, or simply Class 1 monitoring.
Class 2 (Syntactic Monitoring)
Class 1 monitoring solutions are more intelligent than Class 0 tools, but they both have an intrinsic limitation. Generally speaking, they do not understand the correlations between components. Such correlations exist if there are functional or structural dependencies between components, e.g. two containers compete for I/O bandwidth of a physical host that runs them. As a consequence, if the relationship is experiencing any problems (e.g. starvation), it is the operator’s responsibility to correlate the facts and draw their own conclusions. Naturally, the involvement of human operator dramatically affects the time of recovery.
This is not the case for the next class of monitoring solutions, which have just started appearing on the market. They not only understand the historical context like Class 1, but they also learn the syntactic (i.e. structural) context and actively use the knowledge. For instance, if related components degrade one after another (“domino effect”), instead of a series of alerts, they trigger only one.
The difference in the behavior of Class 2 and Class 1 systems can be observed when components have dependencies – for instance if a system consists of a database and many frontends that depend on it. A failure of the database immediately causes a Class 1 monitoring system to flood operators with alerts from all frontends affected. Class 2 instead realizes that the frontends depend on the same database, so most likely it is the failure of database that is the root cause of the problem, and it generates only one alert.
The ability to correlate facts is a great step toward automated root cause analysis. This means that the operator of a Class 2 monitoring system is much less involved in both the workings of the systems and decision making, as messages reach the operator much less frequently. And when they do, they carry more context. Effectively, a single operator with a Class 2 tool can do the work of a few with older tools.
Monitoring And Physics
As of the moment of writing (December 2015), I am aware of only one Class 2 solution that is production ready. It is Ruxit, which appeared on the market quite recently (2014). As expected, it provides much deeper insight into the state of infrastructure. Sometimes, before it triggers a single alert, its distributed AI algorithms analyse millions of samples to make sure that this is a true anomaly. Yet, despite its sophistication, it still does not solve the main problem – how to monitor an infrastructure that changes rapidly.
For this, something more is needed, and the solution can be found if we look again at the classes, but this time from the perspective of physics. This is because all monitoring solutions make certain implicit assumptions about the reality they observe, and for each class the assumptions are slightly different. So in this section I will look again at the classes and physical models they implicitly use.
As explained, Class 0 solutions look only at the current state of observed objects. If we would like to find a physical analogy to containers observed in that way, the best would be atoms, as we imagine them to be solids that don’t change. This is obviously not true, however, information collected in the moment of observation is sufficient to make certain decisions, for instance, classifying them into acceptable or not, based on a threshold value. For atoms the measurement could relate to their size, mass, or energy; for containers – it is their performance.
The situation becomes more interesting if we look at the reality as it is perceived by Class 1 solutions. It is definitely more complex, as they understand that the objects are not solid – their properties fluctuate. By recording how they fluctuate (historical context), they build up knowledge, which allows them to improve their decision process. Yet, it is interesting to notice that the fluctuations are not limited only to the observed object. They have yet another meaning.
I will explain this with an example. Imagine that we have a database and multiple frontends that depend on it. What can be observed is that a moment after the database experienced performance problems, also the performance of frontends drops, and finally, a few moments later, users start to experience a decrease in performance as well.
So the fluctuation is not just a property of the monitored object. It is a wave that propagates through space and over time changes the performance of the objects it meets along its way. So if we would like to find a physical analogy for the reality as seen by Stateful Monitoring, it is a space of radiating objects. This sounds abstract but it takes us closer to finding a solution to the monitoring of ephemeral containers.
Shifting focus from solids to waves improved the decision process of Class 1 solutions, but the real use of the knowledge is made by Syntactic Monitoring, as it collects information on how the performance of objects fluctuate, but in addition to that it knows the structure of systems. So if we translate this to the language of physics, effectively it knows how the waves propagate and where they meet.
Understanding the physical analogy provides a new perspective. The effective performance of every container can be seen as the result of its own performance influenced by the performance of its dependencies. So if in the previous example a frontend would depend on two databases, its effective performance would be probably correlated with their performance. The correlations could be strong or weak, depending on how strong the dependencies currently are, but the fact that they exist is already a valuable information, which can be used in practice.
The algorithm is simple. If there is a drop in performance, it is enough to check if there _was_ (note the time shift) a corresponding decrease in any of its dependencies – if so, we know the culprit. If not, the drop must have been caused by the object itself. The illustration below shows how the decrease of performance of Container C was correlated with one of its dependencies – Container B.
The illustration is also a good representation of the reality as it is seen by Syntactic Monitoring – it is the space of objects, which are sensitive to radiation of objects close to them (the dependencies). As it knows the relative position of objects (the connections between containers), it can track wave interferences in order to analyze the correlation between wave components and their superposition. This simple rule is the basis for the root cause analysis performed by Class 2 solutions.
Monitoring – The Next Generation
The principle used in Syntactic Monitoring is powerful, but does not scale. In dense grids like microservice architectures, which are tightly-interconnected and mutually dependent, the number of interferences is so large that it would be computationally challenging if we would like to track each pair of related components. This severely limits the ability to understand what happens in such system.
Luckily, we already have many years of experience with power grids, which are also dense. So probably we can learn some lessons from there. Let us have a look at the sources of renewable energy. For instance, this is how wind turbines behave when clustered into a group of six (from Grid-Connected Wind Park with Combined Use of Battery and EDLC Energy Storages by Guohong Wu, Yutaka Yoshida and Tamotsu Minakawa):
and solar panels clustered into larger and larger solar plants (from Variability of Renewable Energy Sources by National Renewable Energy Laboratory):
The data provide some clues how software systems could be monitored despite the ephemerality of containers.
First, unstable sources of power deliver quite stable output if connected into a group. Their small fluctuations simply cancel each other out. This leads to a conclusion that the addition (superposition) of unstable waves clears the noise from individual wave components. Indeed, noise reduction is a well-known property of superposition, and it is commonly used by many imaging technologies, both human-made (e.g. arrays of telescopes), and biological (e.g. compound eyes), in order to provide high-quality images from low-quality sensors. So if we look at wind turbines and solar panels as sensors, the superposition of signals they produce (i.e. output power) provides a high-quality image of the phenomena they observe – in this case the wind and the solar radiation over Southern California. Analogically, we might expect that if we superpose performance readings coming from low-quality components such as ephemeral containers, the result will provide high-quality information on the performance of the service they collectively provide.
The next observation is that the larger the group of components is, the slower their collective output changes. And as the chance of a rapid change is small, this means that we don’t need to check the state very often to understand it sufficiently well. In biology the phenomenon is also well known. As the body size of animal grows, its manoeuvrability decreases (due to inertia), so there is no need for predators to observe large prey with high temporal resolution:
“[…] owing to the laws of physics, larger animals physically respond less quickly to a stimulus. Hence we expect selection against costly investment in sensory systems with unnecessarily high temporal resolution in large animals, as information on such timescales can no longer be utilized effectively”.  (Healy et al., 2013)
In our context the conclusion is that if the group of containers is large enough, checking the state of it with low frequency should be sufficient to understand the state of the service it provides. And as low frequency monitoring is cheaper, this is exactly what we are looking for to solve our original problem.
Taking this all into account, I think that the final conclusion is that the solution to the problem of ephemeral containers is a new monitoring system that shifts focus from individual containers to the collective behaviour of a group. So similarly to physical reality, where atoms change too fast for us to observe, we can shift focus to groups of them – particles. If they still change too fast, we can can shift focus to higher levels of abstraction like cells. The process can continue until we find a level where movements happen with the speed comparable to the speed of observer’s reactions. Once the level is reached, then even low frequency sampling (low temporal resolution) is sufficient to follow interactions between such compounds.
So, again if I would like to find a physical analogy to the reality as it is seen by this new type of monitoring, it would be the realm of bigger and bigger compounds that interfere with each other. This means that once we learn to calculate the collective behavior (superposition) of a group, then we can apply the same technologies as Syntactic Monitoring tools use today at the level of individual containers to understand interactions in the observed system.
Class 3 (Semantic Monitoring)
The new monitoring system clearly needs to be more intelligent than previous generations, as it needs to calculate the collective impact of containers onto their environment. This means that on top of historical and syntactic context, it needs to be aware of semantic context – the functional role (meaning) of every container in order to group them in a sensible way. At this new level of monitoring (Semantic Monitoring) the collective aspects, like the consistency of information stored and the resilience of structure, are more important than the health of individual containers.
Class 3 solutions understand that the given group of containers works collectively to achieve a specific goal. That’s why if a distributed database is unstable (e.g. Elasticsearch is in yellow state – not enough replicas), Class 3 immediately alerts that business continuity is endangered, whereas Class 2 believes that everything works fine as long all endpoints respond correctly. And the opposite: if a container crashed, but Class 3 knows that it will be immediately replaced, it does not trigger any alert, whereas Class 2 does, as it does not have the knowledge that the component is part of a larger whole.
Given the original example of 3 containers that provide Service X, this is how Semantic Monitoring would work:
Every time the monitoring system is about to take a sample of the performance of the service, it checks which are the containers that provide it. Then it measures their performance and sums them up (the assumption here is that the service scales linearly).
This way, Class 3 solutions would collect information on overall performance of the service, and later use it for calculating its collective influence onto another group of containers that provide another service.
The DevOps movement appeared as a reaction to the problem of miscommunication between developers and operators, as it has been noticed that when a software system passes from development to production, the large amount of information, related mainly to its structure and the role of its components, is lost.
This research shows that the loss of the contextual information has yet another unexpected consequence – it blocks monitoring solutions from getting more intelligent, because while the structure (i.e. syntax) of systems can be partially reverse-engineered through network discovery, recovering the role of system’s components (i.e. semantics) in an automated way is nearly impossible.
Fortunately, as an industry we have started to face the problem. On one side we have vendors like Ruxit, who build complex processing logic around graph models of systems. On the other side, developers are learning to model systems as graphs (by using tools like HashiCorp’s Terraform). I believe that the old picture below is a good metaphor of the situation we have now, and hopefully soon the initiatives from both worlds will meet and agree upon a common domain-specific language (DSL) for expressing structure (syntax) and the meaning (semantics) of systems.
The Harbour Bridge in Sydney – the collection of National Library of Australia
Once this happens, systems of any complexity will progress from development to production without any loss of information and with no need for having IT operations involved into the process (more on this trend in Dev and Ops in the Time of Clouds). That effectively means that there will be no need for DevOps anymore.
Eventually this will be the beginning of a new generation of cloud computing (more on its features in From Microservices To Artificial Intelligence Operating System).
 Kevin Healy, Luke McNally, Graeme D. Ruxton, Natalie Cooper, Andrew L. Jackson – Metabolic rate and body size are linked with perception of temporal information (2013)
Under a Creative Commons license
Graphics designed by remember to play
Latest posts by Lukasz Guminski (see all)
- Self-Organizing Microservices – Evaluating ContainerPilot on Mantl - April 29, 2016
- Monitoring Performance in Microservice Architectures - December 4, 2015
- From Microservices To Artificial Intelligence Operating System - August 28, 2015