I have so much chaos in my life, it's become normal. You become used to it. You have just to relax, calm down, take a deep breath and try to see how you can make things work rather than complain about how they're wrong.
-- Tom Welling
Monitoring many services on a single server poses some difficulties. Monitoring many services on many servers requires a whole new way of thinking and a new set of tools. As you start embracing microservices, containers, and clusters, the number of deployed containers will begin increasing rapidly. The same holds true for servers that form the cluster. We cannot, anymore, log into a node and look at logs. There are too many logs to look at. On top of that, they are distributed among many servers. While yesterday we had two instances of a service deployed on a single server, tomorrow we might have eight instances deployed to six servers. The same holds true for monitoring. Old tools, like Nagios, are not designed to handle constant changes in running servers and services. We already used Consul that provides a different, not to say new, approach to managing near real-time monitoring and reaction when thresholds are reached. However, that is not enough. Real-time information is valuable to detect that something is wrong, but it does not give us information why the failure happened. We can know that a service is not responding, but we cannot know why.
We need historical information about our system. That information can be in the form of logs, hardware utilization, health checking, and many other things. The need to store historical data is not new and has been in use for a long time. However, the direction that information travels changed over time. While, in the past, most solutions were based on a centralized data collectors, today, due to very dynamic nature of services and servers, we tend to have data collectors decentralized.
What we need for cluster logging and monitoring is a combination of decentralized data collectors that are sending information to a centralized parsing service and data storage. There are plenty of products specially designed to fulfill this requirement, ranging from on-premise to cloud solutions, and everything in between. FluentD, Loggly, GrayLog, Splunk, and DataDog are only a few of the solutions we can employ. I chose to show you the concepts through the ELK stack (ElasticSearch, LogStash, and Kibana). The stack has the advantage of being free, well documented, efficient, and widely used. ElasticSearch established itself as one of the best databases for real-time search and analytics. It is distributed, scalable, highly available, and provides a sophisticated API. LogStash allows us to centralize data processing. It can be easily extended to custom data formats and offers a lot of plugins that can suit almost any need. Finally, Kibana is an analytics and visualization platform with intuitive interface sitting on top of ElasticSearch. The fact that we'll use the ELK stack does not mean that it is better than the other solutions. It all depends on specific use cases and particular needs. I'll walk you through the principles of centralized logging and monitoring using the ELK stack. Once those principles are understood, you should have no problem applying them to a different stack if you choose to do so.
We switched the order of things and chose the tools before discussing the need for centralized logging. Let's remedy that.
The Need for Centralized Logging
In most cases, log messages are written to files. That is not to say that files are the only, nor the most efficient way of storing logs. However, since most teams are using file-based logs in one form or another, for the time being, I'll assume that is your case as well.
If we are lucky, there is one log file per a service or application. However, more often than not, there are multiple files into which our services are outputting information. Most of the time, we do not care much what is written in logs. When things are working well, there is not much need to spend valuable time browsing through logs. A log is not a novel we read to pass the time, nor it is a technical book we spend time with as a way to improve our knowledge. Logs are there to provide valuable info when something, somewhere, went wrong.
The situation seems to be simple. We write information to logs that we ignore most of the time, and when something goes wrong, we consult them and find the cause of the problem in no time. At least, that's what many are hoping for. The reality is far more complicated than that. In all but most trivial systems, the debugging process is much more complex. Applications and services are, almost always, interconnected, and it is often not easy to know which one caused the problem. While it might manifest in one application, investigation often shows that the cause is in another. For example, a service might have failed to instantiate. After some time spent browsing its logs, we might discover that the cause is in the database. The service could not connect to it and failed to launch. We got the symptom, but not the cause. We need to switch to the database log to find it out. With this simple example, we already got to the point where looking at one log is not enough.
With distributed services running on a cluster, the situation complicates exponentially. Which instance of the service is failing? Which server is it running on? What are the upstream services that initiated the request? What is the memory and hard disk usage in the node where the culprit resides? As you might have guessed, finding, gathering, and filtering the information needed for the successful discovery of the cause is often very complicated. The bigger the system, the harder it gets. Even with monolithic applications, things can easily get out of hand. If (micro)services approach is adopted, those problems are multiplied. Centralized logging is a must for all but simplest and smallest systems. Instead, many of us, when things go wrong, start running from one server to another, jumping from one file to the other. Like a chicken with its head cut off - running around with no direction. We tend to accept the chaos logging creates, and consider it part of our profession.
What do we look for in centralized logging? As it happens, many things, but the most important are as follows.
- A way to parse data and send them to a central database in near real-time.
- The capacity of the database to handle near real-time data querying and analytics.
- A visual representation of the data through filtered tables, dashboards, and so on.
The ELK stack (LogStash, ElasticSearch, and Kibana) can do all that and it can easily be extended to satisfy the particular needs we'll set in front of us.
Now that we have a vague idea what we want to accomplish, and have the tools to do that, let us explore a few of the logging strategies we can use. We'll start with the most commonly used scenario and, slowly, move towards more complicated and more efficient ways to define our logging strategy.
The DevOps 2.0 Toolkit
This article was the beginning of the Centralized Logging and Monitoring chapter of The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices book.
This book is about different techniques that help us architect software in a better and more efficient way with microservices packed as immutable containers, tested and deployed continuously to servers that are automatically provisioned with configuration management tools. It's about fast, reliable and continuous deployments with zero-downtime and ability to roll-back. It's about scaling to any number of servers, the design of self-healing systems capable of recuperation from both hardware and software failures and about centralized logging and monitoring of the cluster.
In other words, this book envelops the whole microservices development and deployment lifecycle using some of the latest and greatest practices and tools. We'll use Docker, Kubernetes, Ansible, Ubuntu, Docker Swarm and Docker Compose, Consul, etcd, Registrator, confd, Jenkins, and so on. We'll go through many practices and, even more, tools.