Automated rollback of Helm releases based on logs or metrics

 

by

Continuous delivery is becoming a standard, if you implement the right process you get a predictable deployment. When a change is made in the code, most of the time the build, test, deploy and monitor steps are followed. This is the base for anyone willing to apply automation to their release process.

If a failure is detected during the monitoring phase, then an operator has to verify and rollback the failing release to the previous known working state. This process is time consuming and not always truthful since it requires someone to keep an eye on the monitoring dashboard and react to it.

If the team is well structured and applies the devops way of working, then there will be someone on duty who receives an alert when something goes wrong. Alerts are triggered based on metrics, but still, after receiving the alert, the person on duty has to turn on their laptop (if not on-site), take a look at the graph, think for a moment, realise that the issue is coming from the last release and decide whether or not there is a need to roll back. If you step back and think about it, a significant amount of time was wasted during the process, just because a release failed and nobody kept an eye on it. What if you could automate this process?

In this blog post, I will provide some basic knowledge about Kubernetes deployment and how you can make it smarter using Helm and its plugin capabilities to control the rollback of a release, based on Prometheus metrics.

Some basics about Helm and Kubernetes deployment

If you are using Kubernetes, most of the time you store your configuration manifest into your application repository. Then during the deployment phase, run the kubectl apply -f ./manifest.yaml  command to apply your configuration to the cluster. Following this practice is usually good but requires creating one manifest per application. To simplify this process the team behind Deis (now part of Microsoft) created Helm.

Helm is a Kubernetes package manager which brings templating and controlled release to the next level. It is getting a lot of attraction lately since it simplifies the release process of applications. If you want to know more about Helm, I invite you to read the getting started guide: https://docs.helm.sh.

With a single command, you can release an application:

In the above example, I assume that the company-repo/common-chart  has already been created.

Then, if you want to rollback to a previous version:

Simple isn’t it?

Now that we can upgrade and rollback a release, we can jump on the next step: rolling back based on Prometheus metrics.

Monitoring a release, rollback on failure

Few days ago I released a Helm plugin called helm-monitor. This plugin which, as simple as it sounds, queries a Prometheus or ElasticSearch instance at a regular interval. When a positive result is returned, the application assimilate it as a failure and initiates a rollback of the release.

Below are the steps to follow to monitor a release using Helm via the helm-monitor plugin.

In this example, it is required to have direct access to the Prometheus instance (using kubectl port-forward can be handy if testing locally).

Install the plugin:

Upgrade your release with a new version of the application:

Since the upgrade has started, we can now monitor our release by querying Prometheus. In the following example, a rollback is initiated if the 5xx error rate is over 0 as measured over the last 5 minutes:

By default, helm-monitor runs a query every 10s for 5 minutes. If no results are found after 5 minutes, then the monitoring process is stopped.

Click on the screen below to see a rollback initiated by failure:

A complete example on how to use helm-monitor is available on Github.

So far, this plugin only supports Prometheus and ElasticSearch queries, but you could easily plug in another system. An ElasticSearch query could look like this:

To sum up

When it comes to detecting failure, even if accurate, humans are quite slow to recognize patterns. This is why moving toward an automated way to recognize common failure pattern and act on it during and after a release become necessary. With this approach, releasing applications using Helm together with the Helm Monitor plugin, you can save some time and focus on what matters: delivering high quality product.

I hope this information could help you to improve your delivery process, if you have any questions, feel free to ask.


Want to know more about monitoring in the cloud native era? Download our whitepaper below.

Download Whitepaper

5 Comments

  1. Hello and congrats for the plugin! Seems really nice and promising.
    I would like to use it in a CI/CD pipeline but I want to be able to set myself the name of the release it is going to auto-rollback. Is there an argument that can achieve that ?
    Thanks

    • Hi Kostas, you can specify the release name when calling helm monitor:

      Usage:

      monitor prometheus [flags] RELEASE PROMQL

      Example:

      $ helm monitor prometheus my-release 'rate(http_requests_total{code=~"^5.*$"}[5m]) > 0'

  2. Nice plugin! Very useful for ‘stateless’ deployments, maybe too aggressive when the deployment involves database migrations… in most such cases I wouldn’t feel comfortable with an automated rollback 😀

  3. Hi!
    Thank you for the plugin it’s really promising indeed.
    I have 2 questions:
    1. is it possible to hook it to Application Insights?
    2. is it possible to notify devs somehow that such rollback happened in CI/CD, or the only way is to fail the pipeline and thus get notified?
    Thanks in advance!

    • Hi Evergreen, Application Insights is currently not supported but you can track the progress via this issue opened few hours ago on the project.
      Notification should not be part of this plugin. As you mentioned the best way to notify about a failed release is to fail the pipeline. Helm monitor return exit code 1 after a rollback is initiated.

Leave a Reply

Your email address will not be published. Required fields are marked *