The DevOps 2.5 Toolkit: Monitoring, Logging, and Auto-Scaling Kubernetes is finally finished!!!
What do we do in Kubernetes after we master deployments and automate all the processes? We dive into monitoring, logging, auto-scaling, and other topics aimed at making our cluster resilient, self-sufficient, and self-adaptive.
Dashboards are useless! They are a waste or time. Get Netflix if you want to watch something. It’s cheaper than any other option.
I repeated those words on many public occasions. I think that companies exaggerate the need for dashboards. They spend a lot of effort creating a bunch of graphs and put a lot of people in charge of staring at them. As if that’s going to help anyone. The main advantage of dashboards is that they are colorful and full of lines, boxes, and labels. Those properties are always an easy sell to decision makers like CTOs and heads of departments. When a software vendor comes to a meeting with decision makers with authority to write checks, he knows that there is no sale without “pretty colors”. It does not matter what that software does, but how it looks like. That’s why every software company focuses on dashboards.
Think about it. What good is a dashboard for? Are we going to look at graphs until a bar reaches a red line indicating that a critical threshold is reached? If that’s the case, why not create an alert that will trigger under the same conditions and stop wasting time staring at screens and waiting until something happens. Instead, we can be doing something more useful (like staring Netflix).
Kubernetes HorizontalPodAutoscaler (HPA) and Cluster Autoscaler (CA) provide essential, yet very rudimentary mechanisms to scale our Pods and clusters. While they do scaling decently well, they do not solve our need to be alerted when there’s something wrong, nor do they provide enough information required to find the cause of an issue. We’ll need to expand our setup with additional tools that will allow us to store and query metrics as well as to receive notifications when there is an issue.
If we focus on tools that we can install and manage ourselves, there is very little doubt about what to use. If we look at the list of Cloud Native Computing Foundation (CNCF) projects, only two graduated so far (October 2018). Those are Kubernetes and Prometheus. Given that we are looking for a tool that will allow us to store and query metrics and that Prometheus fulfills that need, the choice is straightforward. That is not to say that there are no other similar tools worth considering. There are, but they are all service based. We might explore them later but, for now, we’re focused on those that we can run inside our cluster. So, we’ll add Prometheus to the mix and try to answer a simple question. What is Prometheus?
Kubernetes is probably the biggest project we know. It is vast, and yet many think that after a few weeks or months of reading and practice they know all there is to know about it. It’s much bigger than that, and it is growing faster than most of us can follow. How far did you get in Kubernetes adoption?
From my experience, there are four main phases in Kubernetes adoption.
In the first phase, we create a cluster and learn intricacies of Kube API and different types of resources (e.g., Pods, Ingress, Deployments, StatefulSets, and so on). Once we are comfortable with the way Kubernetes works, we start deploying and managing our applications. By the end of this phase, we can shout “look at me, I have things running in my production Kubernetes cluster, and nothing blew up!” I explained most of this phase in The DevOps 2.3 Toolkit: Kubernetes.
In the Forwarding Logs From All Containers Running Anywhere Inside A Docker Swarm Cluster article, we managed to add centralized logging to our cluster. Logs from any container running inside any of the nodes are shipped to a central location. They are stored in ElasticSearch and available through Kibana. However, the fact that we have easy access to all the logs does not mean that we have all the information we would need to debug a problem or prevent it from happening in the first place. We need to complement our logs with the rest of the information about the system. We need much more than what logs alone can provide.
I have so much chaos in my life, it’s become normal. You become used to it. You have just to relax, calm down, take a deep breath and try to see how you can make things work rather than complain about how they’re wrong.
— Tom Welling
Monitoring many services on a single server poses some difficulties. Monitoring many services on many servers requires a whole new way of thinking and a new set of tools. As you start embracing microservices, containers, and clusters, the number of deployed containers will begin increasing rapidly. The same holds true for servers that form the cluster. We cannot, anymore, log into a node and look at logs. There are too many logs to look at. On top of that, they are distributed among many servers. While yesterday we had two instances of a service deployed on a single server, tomorrow we might have eight instances deployed to six servers. The same holds true for monitoring. Old tools, like Nagios, are not designed to handle constant changes in running servers and services. We already used Consul that provides a different, not to say new, approach to managing near real-time monitoring and reaction when thresholds are reached. However, that is not enough. Real-time information is valuable to detect that something is wrong, but it does not give us information why the failure happened. We can know that a service is not responding, but we cannot know why.
With Docker there was not supposed to be a need to store logs in files. We should output information to stdout/stderr and the rest will be taken care by Docker itself. When we need to inspect logs all we are supposed to do is run
docker logs [CONTAINER_NAME].
With Docker and ever more popular usage of micro services, number of deployed containers is increasing rapidly. Monitoring logs for each container separately quickly becomes a nightmare. Monitoring few or even ten containers individually is not hard. When that number starts moving towards tens or hundreds, individual logging is unpractical at best. If we add distributed services the situation gets even worst. Not only that we have many containers but they are distributed across many servers.
The solution is to use some kind of centralized logging. Our favourite combination is ELK stack (ElasticSearch, LogStash and Kibana). However, centralized logging with Docker on large-scale was not a trivial thing to do (until version 1.6 was released). We had a couple of solutions but none of them seemed good enough.