Visualizing Kubernetes Metrics And Alerts

Dashboards are useless! They are a waste or time. Get Netflix if you want to watch something. It’s cheaper than any other option.

I repeated those words on many public occasions. I think that companies exaggerate the need for dashboards. They spend a lot of effort creating a bunch of graphs and put a lot of people in charge of staring at them. As if that’s going to help anyone. The main advantage of dashboards is that they are colorful and full of lines, boxes, and labels. Those properties are always an easy sell to decision makers like CTOs and heads of departments. When a software vendor comes to a meeting with decision makers with authority to write checks, he knows that there is no sale without “pretty colors”. It does not matter what that software does, but how it looks like. That’s why every software company focuses on dashboards.

Think about it. What good is a dashboard for? Are we going to look at graphs until a bar reaches a red line indicating that a critical threshold is reached? If that’s the case, why not create an alert that will trigger under the same conditions and stop wasting time staring at screens and waiting until something happens. Instead, we can be doing something more useful (like staring Netflix).

Is our “panic criteria” more complex than what can be expressed through alerts? I do think that it is more complex. However, that complexity cannot be reflected through pre-defined graphs. Sure, unexpected things happen, and we need to dig through data. However, the word “unexpected” defies what dashboards provide. They are all about the expected outcomes. Otherwise, how are we going to define a graph without knowing what to expect? “It can be anything” cannot be translated to a graph. Dashboards with graphs are our ways to assume what might go wrong and put those assumptions on a screen or, more often than not, on a lot of screens. However, unexpected can only be explored by querying metrics and going deeper and deeper until we find the cause of an issue. That’s investigative work that does not translate well to dashboards. We use Prometheus queries for that.

And yet, here I am dedicating a post to dashboards.

I do admit that dashboards are not (fully) useless. They are useful, sometimes. What I truly wanted to convey is that their usefulness is exaggerated and that we might require to construct and use dashboards differently than what many are used to.

If I’m claiming that the value dashboards bring to the table is lower than we think, you might be asking yourself the same question from the beginning of this post. Why are we talking about dashboards? Well, I already changed my statement from “dashboards are useless” to “there is some value in dashboards”. They can serve as a registry for queries. Through dashboards, we do not need to memorize expressions that we would need to write in Prometheus. They might be a good starting point of our search for the cause of an issue before we jump into Prometheus for some deeper digging into metrics. But, there is another reason I am including dashboards into the solution.

I love big displays. It’s very satisfying to enter into a room with large screens showing stuff that seem to be important. There is usually a room where operators sit surrounded with monitors on all four walls. That’s usually an impressive sight. However, there is a problem with many such situations. A bunch of monitors displaying a lot of graphs might not amount to much more than a pretty sight. After the initial few days, nobody will stare at graphs. If that’s not true, you can just as well fire that person knowing that he was faking his work.

Let me repeat it one more time.

Dashboards are not designed for us to stare at them, especially not when they are on big screens where everyone can see them.

So, if it’s a good idea to have big screens, but graphs are not a good candidate to decorate them, what should we do instead? The answer lies in semaphores. They are similar to alerts, and they should provide a clear indication of the status of the system. If everything on the screen is green, there is no reason for us to do anything. One of them turning red is a cue that we should do something to correct the problem. Therefore, it is imperative that we try to avoid false positives. If something turns red, and that does not require any action, we are likely to start ignoring it in the future. When that happens, we are risking the situation in which when we ignore a real issue, thinking that it is just another false positive. Hence, every appearance of an alarm should be followed by an action. That can be either a fix that will correct the system or a change in the conditions that turned one of the semaphores red. In either case, we should not ignore it.

The main problem with semaphores is that they are not as appealing to CTOs and other decision makers. They are not colorful, nor do they show a lot of boxes, lines, and numbers. People often confuse usefulness with how pleasing something is to look at. Never the less, we are not building something that should be sold to CTOs, but something that can be helpful in our day-to-day work.

Semaphores are much more useful than graphs as a way to see the status of the system, even though they do not look as colorful and eye-pleasing as graphs.

A dashboard should, in my experience, look as the screenshot that follows.

Does all that mean that all our dashboards should be green and red boxes with a single number inside them? I do believe that semaphores should be the “default” display. When they are green, there’s no need for anything else. If that’s not the case, we should extend the number of semaphores, instead of cluttering our monitors with random graphs. However, that begs the question. What should we do when some of the boxes turn red or even orange?

The extended view of the dashboard can look as follows.

The panels inside the Graphs row are a reflection of the panels (semaphores) in the Alerts row. Each graph shows more detailed data related to the single stat from the same location (but a different row). That way, we do not need to waste our time trying to figure out which graph corresponds to the “red box”. Instead, we can jump straight into the corresponding graph. If the semaphore on in the second row on the right turns red, look at the graphs in the second row on the right. If multiple boxes turn red, we can take a quick look at related graphs and try to find the relation (if there is any). More often than not, we’ll have to switch from Grafana to Prometheus and dig deeper into metrics.

Dashboards like the one in the screenshot should give us a quick headstart towards the resolution of an issue. The semaphores on the top provide alerting mechanism that should lead to the graphs below that should give a quick indication of the possible causes of the problem. From there on, if the cause is an obvious one, we can move to Prometheus and start debugging (if that’s the right word).

Dashboards with semaphores should be displayed on big screens around the office. They should provide an indication of a problem. Corresponding graphs (and other panels) provide a first look at the issue. Prometheus serves as the debugging tool we use to dig into metrics until we find the culprit.

The DevOps 2.5 Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes

The article you just read is an extract from The DevOps 2.5 Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes.

What do we do in Kubernetes after we master deployments and automate all the processes? We dive into monitoring, logging, auto-scaling, and other topics aimed at making our cluster resilient, self-sufficient, and self-adaptive.

Kubernetes is probably the biggest project we know. It is vast, and yet many think that after a few weeks or months of reading and practice they know all there is to know about it. It’s much bigger than that, and it is growing faster than most of us can follow. How far did you get in Kubernetes adoption?

From my experience, there are four main phases in Kubernetes adoption.

In the first phase, we create a cluster and learn intricacies of Kube API and different types of resources (e.g., Pods, Ingress, Deployments, StatefulSets, and so on). Once we are comfortable with the way Kubernetes works, we start deploying and managing our applications. By the end of this phase, we can shout “look at me, I have things running in my production Kubernetes cluster, and nothing blew up!” I explained most of this phase in The DevOps 2.3 Toolkit: Kubernetes.

The second phase is often automation. Once we become comfortable with how Kubernetes works and we are running production loads, we can move to automation. We often adopt some form of continuous delivery (CD) or continuous deployment (CDP). We create Pods with the tools we need, we build our software and container images, we run tests, and we deploy to production. When we’re finished, most of our processes are automated, and we do not perform manual deployments to Kubernetes anymore. We can say that things are working and I’m not even touching my keyboard. I did my best to provide some insights into CD and CDP with Kubernetes in The DevOps 2.4 Toolkit: Continuous Deployment To Kubernetes.

The third phase is in many cases related to monitoring, alerting, logging, and scaling. The fact that we can run (almost) anything in Kubernetes and that it will do its best to make it fault tolerant and highly available, does not mean that our applications and clusters are bulletproof. We need to monitor the cluster, and we need alerts that will notify us of potential issues. When we do discover that there is a problem, we need to be able to query metrics and logs of the whole system. We can fix an issue only once we know what the root cause is. In highly dynamic distributed systems like Kubernetes, that is not as easy as it looks.

Further on, we need to learn how to scale (and de-scale) everything. The number of Pods of an application should change over time to accommodate fluctuations in traffic and demand. Nodes should scale as well to fulfill the needs of our applications.

Kubernetes already has the tools that provide metrics and visibility into logs. It allows us to create auto-scaling rules. Yet, we might discover that Kuberentes alone is not enough and that we might need to extend our system with additional processes and tools. This phase is the subject of this book. By the time you finish reading it, you’ll be able to say that your clusters and applications are truly dynamic and resilient and that they require minimal manual involvement. We’ll try to make our system self-adaptive.

I mentioned the fourth phase. That, dear reader, is everything else. The last phase is mostly about keeping up with all the other goodies Kubernetes provides. It’s about following its roadmap and adapting our processes to get the benefits of each new release.

Buy it now from Amazon, LeanPub, or look for it through your favorite book seller.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s