Monitoring Microservices with Prometheus

 

by Philip Winder

We seem to say this a lot at Container Solutions, but the combination of microservice architectures and the practice of DevOps have smashed apart the assumptions made by traditional tools. None more so than in the monitoring sector, which we’ve talked about in the semantic monitoring and monitoring performance blog posts.

Traditional monitoring solutions like Nagios and New Relic (to give them credit, they do appear to be progressing) initially made two key assumptions: applications are monolithic and their users are operations staff. Both assumptions have been demolished, which left a gaping hole in the marketplace for microservice-come-devops monitoring solutions. Since we’re also keen on open-source solutions there are only a few options. Some notable solutions are Graphite, InfluxDb and the focus of this article, Prometheus.

Prometheus has been accepted by the Cloud Native Computing Foundation as a hosted, incubating project, which helps to justify the need of such a technology and verify that Prometheus is a promising solution to that problem.

The problem, when dressed down to its bare necessities – sorry, my daughter has just watched the Jungle Book for the tenth time – is that developers are assuming long term responsibility of the code that they write. They require monitoring on a level that is completely different to traditional operational tasks. In order to ensure that their code is working correctly/optimally, they require monitoring to be an integral part of the code. In fact, I’m going to go as far to suggest that monitoring is the most important part of the code, because only these metrics are capable of truly verifying that requirements have been met.

Hence, we, as developers, require a code-centric open-source monitoring solution that is scalable and is capable of: accepting metrics, storing metrics, alerting and providing some way to visualise and inspect the metrics for ongoing optimisation and analysis. For our most recent projects we’ve chosen Prometheus and this post will show you how to integrate it into Java and Go based microservices.

Introducing Prometheus

Prometheus is a simple, effective open-source monitoring system. It is in full-time use in the projects being developed in collaboration with our good friends over at Weaveworks. It has become particularly successful because of its intuitive simplicity. It doesn’t try to do anything fancy. It provides a data store, data scrapers, an alerting mechanism and a very simple user interface.

There is a significant amount of online discussion about Prometheus’ data store. To cut a (very) long story short, the data is stored as key value pairs in memory cached files. Therefore, if you have a significant amount of metrics/services, then you need to be careful about storage requirements. A single instance can manage a large number of services/metrics. Suggestions seem to be around one million scrapes per ten seconds – 100,000 / s (reference). After this point the initial recommendation is to split instances per function (frontend/backend/etc.). After that, sharding is possible, but is manually configured. This is potentially one area that is harmed by the simplicity, although it is unlikely to be a serious issue for all but the most demanding users.

The data scrapers, as the name suggests, use a pull model. I.e. Prometheus has to pull metrics from individual machines and services. This does mean that you have to plan to provide metrics endpoints in your custom services (there are automatic scrapers for a range of common out-of-the-box technologies), but this isn’t as bad as it sounds (I’ll show you how to add metrics to Java and Go based services later). Some might argue that this is rather antiquated, but I like it, because it is obvious. And when things are obvious, you don’t need to spend man-hours (man-weeks for some monitoring systems) training users and developers. When things are obvious, it’s much harder to make mistakes.

The only thing to be careful about is the naming of your metrics. It’s really easy to let the numbers of metrics get out of hand and it does help to have some central governance over the naming of them. For more information see the Prometheus documentation.

Prometheus’ use of alerting tools is also very simple. It is possible to send alerts to a number of services: Email, Generic Webhooks, PagerDuty, HipChat, Slack, Pushover, Flowdock. All of the setup is performed via configuration and the altering rules are scriptable.

Finally, whilst the UI isn’t going to win any design awards, it is functional and great for debugging during development. For full-time dashboards, consider one of the many dashboarding tools that support Prometheus such as PromDash or Grafana. This assertive decision not to include a fully-fledged dashboard is another great decision. It means that Prometheus can concentrate on what it does best, monitoring.

The prometheus user interface
The prometheus user interface

Adding Prometheus Metrics to Go Applications

More often than not when developing microservices in Go, we make use of the Go Kit toolkit. Go Kit makes it incredibly easy to add Prometheus monitoring, which is supported out-of-the-box.

Prometheus monitoring comes from the Metrics package, in the form of a decorator pattern, which is the same pattern used for Go Kit’s logging. The middleware function decorates a service by implementing the same interface and intercepting the call from an endpoint. For example, if you imagine you provided a Count method on your service, then the interface would look like:

Then to make the metrics middleware, we would pass in Go Kit’s implementation of Prometheus’ counter and histogram with:

Finally, we would decorate the service with a function that provides a count of the number of hits and an estimate of the latency of the service:

In the method that wires the service and middleware together, you would then simply create new instances of the metrics objects and add a rest endpoints called /metrics that returns the handlerstdprometheus.Handler(). That’s it!

For complete examples, see the Go Kit documentation or any of the Go-based services in the Microservices-Demo repositories.

Adding Prometheus Metrics to Java Applications

Prometheus provides libraries for Java applications which can be used directly, but we often use Spring and Spring Boot, and it would be a shame to not take advantage of Spring Boot Actuator. Hence, with a huge chunk of help from Johan Zietsman’s blog post, it is reasonably easy to get a lot of monitoring for very little code.

First you need to include the actuator and prometheus dependencies in your maven/gradle file.

Next, we need to provide code to convert Actuator’s metrics into a format that Prometheus can understand. Counters and gauges are stored in maps, and the name of the actuator metric is sanitised into one that is Prometheus-friendly.

Then we need to provide endpoints for the metrics. This first file writes the map into the admittedly crazy Prometheus text format.

And then an Spring MVC endpoint is created to host the data.

And finally some configuration to wire it all together. You will notice that the endpoint is created on /prometheus and not the standard/metrics. This is because Actuator is still running, and exposes itself on /metrics. You aren’t able to override this functionality (easily) so it is simpler to use a different endpoint and change the Prometheus configuration.

That’s it! Admittedly, this isn’t perfect. With this configuration it isn’t possible to see individual service endpoints (because you haven’t coded them), but it does provide a lot of functionality for very little code. In the future we intend to make this automatic by providing an artefact that you can @Enable.

To see an example of this in action, visit any of the Java-based services in the Microservices-Demo repositories. For example the carts repository.

The following two tabs change content below.

Philip Winder

Phil Winder is a multi-disciplinary consultant architect that specialises in the Research and Development of cutting edge technology. His expertise lies between the boundaries of Software Development and Machine Learning. Describing himself simply as an “Engineer”, he has 10 years experience in a wide range of Engineering disciplines (7 years software and machine learning, 3 years production electronics). Phil has Ph.D. and Masters degrees from the University of Hull, U.K. These were in Electronics, with a focus on embedded signal processing.

Latest posts by Philip Winder (see all)

3 Comments

  1. Your code works nicely right up to the point that you start to try to use Eureka metrics. Sadly then the counter breaks as actuate Delta values can be negative.
    Instead (according to the docs at least) they want you to use a Gauge.

    If you do that then the PrometheusMetricWriter simplifies to something more like this:

    public class PrometheusMetricWriter implements MetricWriter {

    private final ConcurrentMap gauges = new ConcurrentHashMap();
    private final CollectorRegistry collectorRegistry;

    @Autowired
    public PrometheusMetricWriter(CollectorRegistry collectorRegistry) {
    this.collectorRegistry = collectorRegistry;
    }

    @Override
    public void increment(Delta delta) {
    gauge(delta.getName()).inc(delta.getValue().doubleValue());
    }

    @Override
    public void reset(String metricName) {
    gauge(metricName).clear();
    }

    @Override
    public void set(Metric value) {
    gauge(value.getName()).set(value.getValue().doubleValue());
    }

    private Gauge gauge(String name) {
    String key = Gauge.sanitizeMetricName(name);
    return gauges.computeIfAbsent(key, k -> Gauge.build().name(k).help(k).register(collectorRegistry));
    }

    }

  2. Hi Philip,

    Thanks for sharing your article.

    I read your article with full interest because of mentioning Prometheus. One of my customers in the Netherlands uses Prometheus.

    Before I go further I’m not a biased consultant regarding open source or commercial monitoring software. If the monitoring solution fulfills the needs of the customer, that’s perfectly fine.

    Could you elaborate me on two quotes?

    Quote #1: “They require monitoring on a level that is completely different to traditional operational tasks. In order to ensure that their code is working correctly/optimally, they require monitoring to be an integral part of the code. ”

    The questions I have:

    What should be different? What are the reasons that monitoring should be part of the code? What are the performance penalties? It sounds like embedding debugging statements in your code, something you should do before your code goes into production. Using profiling /diagnoses agents are perfectly capable to track and track your code statements and relate this to end user performance and stability problems.

    A remark regarding to ensure your code is working optimally: embedding monitoring code doesn’t ensure if your code works optimally. Performance testing under the right conditions and performance monitoring in production provides you the insight what your code is doing under load. Performance testing is a moment in time, no guarantee that performance and stability issues should not occur.

    Another quote:
    “Traditional monitoring solutions like Nagios and New Relic (to give them credit, they do appear to be progressing) initially made two key assumptions: applications are monolithic and their users are operations staff.”

    I disagree with your quote, but I like to understand what you mean.

    What do you mean with traditional monitoring solutions? What is the definition? You positioned two monitoring perspectives: technical/system monitoring (Nagios without APM functionality) and New Relic (Cloud monitoring with APM functionality)

    Did you check your assumptions? Because certain commercial APM tooling treats micro services the way it should be monitored.

    In the current world of Dev/Ops which moves to NoOps, no longer the operations staff is the only role that is responsible for keeping the lights on. The current APM tooling is focussed on the complete application life cycle, including CI/CD. These tools addresses the collaboration of different roles like Developers and Ops.

    I really like to hear from you.

    • Hi Stephen,
      You have some great questions and the answers deserve something longer than a quick reply. But let me at least try.

      I argue that from an engineering perspective, metrics that form a part of monitoring strategy are just as important as the code. Hence, they should form a core part of the code. They are not an afterthought. They should be designed so that engineering can ensure systems are running to the best of their ability and provide metrics that warn against failure.

      Regarding performance, the performance hit for Prometheus is so small it’s not worth considering. Remember that Prometheus is a pull model. So the metric is simply updating some in-memory variable. Prometheus will scrape your /metrics endpoint and this will take some rendering time, but this is minimal and can be optimised if required.

      Profiling/debuggers attached to your code definitely have a performance hit, because of all the information they need to capture from your code. Although I don’t have any figures to hand.

      I mean “optimally” in the most general sense. I.e. it’s supposed to do what you intended it to do. Optimally means different things for different people.

      Again I use the word “traditional tools” very generally. Generally 5 or 10 years ago you used to build your monolith in a particular language, then find a monitoring tool that could hook into the language runtime (or equivalent). Then a team of people would be responsible for monitoring this software. I’m suggesting that, for the majority of companies in 2017, this framework isn’t scalable. Along with other inevitable anti-innovation practices, the delays would make a significant impact to the viability of the business. Indeed, the industry has accepted that this isn’t scalable and is moving towards greater empowerment through ownership. Which means engineers now need their own monitoring tools, not Operations’ monitoring tools.

      Also, I’m suggesting that this “monitoring as an afterthought” does not work for microservices or lower. Nagios et al. are trying to pull these new methods of development (microservices/serverless) into their catalogue, through plugins/extras. This might be acceptable, depending on a business’ requirements.

      I apologise for the brevity, I hope I’ve answered your questions.

      Thanks again for the great comments.
      Phil

Leave a Reply

Your email address will not be published. Required fields are marked *