Visualizing Kubernetes Metrics And Alerts

Dashboards are useless! They are a waste or time. Get Netflix if you want to watch something. It’s cheaper than any other option.

I repeated those words on many public occasions. I think that companies exaggerate the need for dashboards. They spend a lot of effort creating a bunch of graphs and put a lot of people in charge of staring at them. As if that’s going to help anyone. The main advantage of dashboards is that they are colorful and full of lines, boxes, and labels. Those properties are always an easy sell to decision makers like CTOs and heads of departments. When a software vendor comes to a meeting with decision makers with authority to write checks, he knows that there is no sale without “pretty colors”. It does not matter what that software does, but how it looks like. That’s why every software company focuses on dashboards.

Think about it. What good is a dashboard for? Are we going to look at graphs until a bar reaches a red line indicating that a critical threshold is reached? If that’s the case, why not create an alert that will trigger under the same conditions and stop wasting time staring at screens and waiting until something happens. Instead, we can be doing something more useful (like staring Netflix).

Is our “panic criteria” more complex than what can be expressed through alerts? I do think that it is more complex. However, that complexity cannot be reflected through pre-defined graphs. Sure, unexpected things happen, and we need to dig through data. However, the word “unexpected” defies what dashboards provide. They are all about the expected outcomes. Otherwise, how are we going to define a graph without knowing what to expect? “It can be anything” cannot be translated to a graph. Dashboards with graphs are our ways to assume what might go wrong and put those assumptions on a screen or, more often than not, on a lot of screens. However, unexpected can only be explored by querying metrics and going deeper and deeper until we find the cause of an issue. That’s investigative work that does not translate well to dashboards. We use Prometheus queries for that.

And yet, here I am dedicating a post to dashboards.

I do admit that dashboards are not (fully) useless. They are useful, sometimes. What I truly wanted to convey is that their usefulness is exaggerated and that we might require to construct and use dashboards differently than what many are used to.

If I’m claiming that the value dashboards bring to the table is lower than we think, you might be asking yourself the same question from the beginning of this post. Why are we talking about dashboards? Well, I already changed my statement from “dashboards are useless” to “there is some value in dashboards”. They can serve as a registry for queries. Through dashboards, we do not need to memorize expressions that we would need to write in Prometheus. They might be a good starting point of our search for the cause of an issue before we jump into Prometheus for some deeper digging into metrics. But, there is another reason I am including dashboards into the solution.

I love big displays. It’s very satisfying to enter into a room with large screens showing stuff that seem to be important. There is usually a room where operators sit surrounded with monitors on all four walls. That’s usually an impressive sight. However, there is a problem with many such situations. A bunch of monitors displaying a lot of graphs might not amount to much more than a pretty sight. After the initial few days, nobody will stare at graphs. If that’s not true, you can just as well fire that person knowing that he was faking his work.

Let me repeat it one more time.

Dashboards are not designed for us to stare at them, especially not when they are on big screens where everyone can see them.

So, if it’s a good idea to have big screens, but graphs are not a good candidate to decorate them, what should we do instead? The answer lies in semaphores. They are similar to alerts, and they should provide a clear indication of the status of the system. If everything on the screen is green, there is no reason for us to do anything. One of them turning red is a cue that we should do something to correct the problem. Therefore, it is imperative that we try to avoid false positives. If something turns red, and that does not require any action, we are likely to start ignoring it in the future. When that happens, we are risking the situation in which when we ignore a real issue, thinking that it is just another false positive. Hence, every appearance of an alarm should be followed by an action. That can be either a fix that will correct the system or a change in the conditions that turned one of the semaphores red. In either case, we should not ignore it.

The main problem with semaphores is that they are not as appealing to CTOs and other decision makers. They are not colorful, nor do they show a lot of boxes, lines, and numbers. People often confuse usefulness with how pleasing something is to look at. Never the less, we are not building something that should be sold to CTOs, but something that can be helpful in our day-to-day work.

Semaphores are much more useful than graphs as a way to see the status of the system, even though they do not look as colorful and eye-pleasing as graphs.

A dashboard should, in my experience, look as the screenshot that follows.

Does all that mean that all our dashboards should be green and red boxes with a single number inside them? I do believe that semaphores should be the “default” display. When they are green, there’s no need for anything else. If that’s not the case, we should extend the number of semaphores, instead of cluttering our monitors with random graphs. However, that begs the question. What should we do when some of the boxes turn red or even orange?

The extended view of the dashboard can look as follows.

The panels inside the Graphs row are a reflection of the panels (semaphores) in the Alerts row. Each graph shows more detailed data related to the single stat from the same location (but a different row). That way, we do not need to waste our time trying to figure out which graph corresponds to the “red box”. Instead, we can jump straight into the corresponding graph. If the semaphore on in the second row on the right turns red, look at the graphs in the second row on the right. If multiple boxes turn red, we can take a quick look at related graphs and try to find the relation (if there is any). More often than not, we’ll have to switch from Grafana to Prometheus and dig deeper into metrics.

Dashboards like the one in the screenshot should give us a quick headstart towards the resolution of an issue. The semaphores on the top provide alerting mechanism that should lead to the graphs below that should give a quick indication of the possible causes of the problem. From there on, if the cause is an obvious one, we can move to Prometheus and start debugging (if that’s the right word).

Dashboards with semaphores should be displayed on big screens around the office. They should provide an indication of a problem. Corresponding graphs (and other panels) provide a first look at the issue. Prometheus serves as the debugging tool we use to dig into metrics until we find the culprit.

The DevOps 2.5 Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes

The article you just read is an extract from The DevOps 2.5 Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes.

What do we do in Kubernetes after we master deployments and automate all the processes? We dive into monitoring, logging, auto-scaling, and other topics aimed at making our cluster resilient, self-sufficient, and self-adaptive.

The book is still in progress and it is currently available only from LeanPub.com. I’d love to hear your thoughts and comments.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s