Tag Archives: Prometheus

Post-Mortem Documents About Production Issues With Fiberplane

What do we do when we find an issue in production? Fix it? How about generating a post-mortem report while fixing it?

That’s what Fiberplane is all about. In this video, I’ll show you how to use Fiberplane to generate postmortem documents about incidents in production Kubernetes clusters.

Continue reading

Using Prometheus To Measure Error Rate Of Requests Managed By Istio

The video below is a clip from the "Canary Deployments To Kubernetes Using Istio and Friends" course in Udemy. It provides a high-level explanation how service mesh works. Additional preview clips are available inside the course. Please use the coupons with discounts provided below.

If you do enrol into the course, please let us know what you think and do NOT forget to rate it.

$9.99 coupon available until January 28, 2020: https://www.udemy.com/course/canary-deployments-to-kubernetes-using-istio-and-friends/?couponCode=B0B4A264C420C19F58D5

$13.99 coupon available until January 27, 2020: https://www.udemy.com/course/canary-deployments-to-kubernetes-using-istio-and-friends/?couponCode=7F311AD2C040117054AB

Coupon with whatever is Udemy’s best price: https://www.udemy.com/course/canary-deployments-to-kubernetes-using-istio-and-friends/?referralCode=75549ECDBC41B27D94C4

Visualizing Kubernetes Metrics And Alerts

Dashboards are useless! They are a waste or time. Get Netflix if you want to watch something. It’s cheaper than any other option.

I repeated those words on many public occasions. I think that companies exaggerate the need for dashboards. They spend a lot of effort creating a bunch of graphs and put a lot of people in charge of staring at them. As if that’s going to help anyone. The main advantage of dashboards is that they are colorful and full of lines, boxes, and labels. Those properties are always an easy sell to decision makers like CTOs and heads of departments. When a software vendor comes to a meeting with decision makers with authority to write checks, he knows that there is no sale without “pretty colors”. It does not matter what that software does, but how it looks like. That’s why every software company focuses on dashboards.

Think about it. What good is a dashboard for? Are we going to look at graphs until a bar reaches a red line indicating that a critical threshold is reached? If that’s the case, why not create an alert that will trigger under the same conditions and stop wasting time staring at screens and waiting until something happens. Instead, we can be doing something more useful (like staring Netflix).
Continue reading

A Quick Introduction To Prometheus And Alertmanager

Kubernetes HorizontalPodAutoscaler (HPA) and Cluster Autoscaler (CA) provide essential, yet very rudimentary mechanisms to scale our Pods and clusters. While they do scaling decently well, they do not solve our need to be alerted when there’s something wrong, nor do they provide enough information required to find the cause of an issue. We’ll need to expand our setup with additional tools that will allow us to store and query metrics as well as to receive notifications when there is an issue.

If we focus on tools that we can install and manage ourselves, there is very little doubt about what to use. If we look at the list of Cloud Native Computing Foundation (CNCF) projects, only two graduated so far (October 2018). Those are Kubernetes and Prometheus. Given that we are looking for a tool that will allow us to store and query metrics and that Prometheus fulfills that need, the choice is straightforward. That is not to say that there are no other similar tools worth considering. There are, but they are all service based. We might explore them later but, for now, we’re focused on those that we can run inside our cluster. So, we’ll add Prometheus to the mix and try to answer a simple question. What is Prometheus?
Continue reading

Blueprint Of A Self-Sufficient Docker Cluster

The article that follows is an extract from the last chapter of The DevOps 2.2 Toolkit: Self-Sufficient Docker Clusters book. It provides a good summary into the processes and tools we explored in the quest to build a self-sufficient cluster that can (mostly) operate without humans.

We split the tasks that a self-sufficient system should perform into those related to services and those oriented towards infrastructure. Even though some of the tools are used in both groups, the division between the two allowed us to keep a clean separation between infrastructure and services running on top of it.
Continue reading

Auto-Scaling Docker Swarm Services Using Instrumented Metrics

If you prefer reading, please visit Auto-Scaling Docker Swarm Services Using Instrumented Metrics tutorial from Docker Flow Monitor documentation. Otherwise, feel free to watch the video that follows.

The DevOps 2.2 Toolkit: Self-Sufficient Docker Clusters

The article you just read is the summary of the progress we made in The DevOps 2.2 Toolkit: Self-Healing Docker Clusters. You can find the hands-on exercises that build the system in the book.

If you liked this article, you might be interested in The DevOps 2.2 Toolkit: Self-Sufficient Docker Clusters book. The book goes beyond Docker and schedulers and tries to explore ways for building self-adaptive and self-healing Docker clusters. If you are a Docker user and want to explore advanced techniques for creating clusters and managing services, this book might be just what you’re looking for.

Please get a copy from Amazon, LeanPub, or look for it through your favorite book seller.

Give the book a try and let me know what you think.

Collecting Metrics and Monitoring Docker Swarm Clusters

In the Forwarding Logs From All Containers Running Anywhere Inside A Docker Swarm Cluster article, we managed to add centralized logging to our cluster. Logs from any container running inside any of the nodes are shipped to a central location. They are stored in ElasticSearch and available through Kibana. However, the fact that we have easy access to all the logs does not mean that we have all the information we would need to debug a problem or prevent it from happening in the first place. We need to complement our logs with the rest of the information about the system. We need much more than what logs alone can provide.
Continue reading