I am continuously getting questions about blue-green releases inside a Docker Swarm cluster. Viktor, in your The DevOps 2.0 Toolkit book you told us to use blue-green deployment. How do we do it with services running inside a Swarm cluster? My answer is usually something along the following lines. With the old Swarm, blue-green releases were easier than rolling updates (neither were supported out of the box). Now we got rolling updates. Use them! The reaction to that is often that we still want blue-green releases.
This post is my brainstorming on this subject. I did not write it as a result of some deep thinking. There is no great wisdom in it. I just wrote what was passing through my mind while I was trying to answer another one of the emails containing blue-green deployment questions. What follows might not make much sense. Don’t be harsh on me.
Rolling updates through the
docker service update command provide out of the box zero-downtime deployments. It works great, we like it, and we were even willing to ignore bugs it contained in early releases. Actually, we were not ignoring bugs. The first release with Swarm Mode (v1.12.0) had too many bugs. But, it was the first release of a completely new product, so we opened a lot of issues and hoped for the best. Indeed, Docker folks responded quickly and fixed many in the next release, most in the one after that, and so on. A few releases later rolling updates worked like a charm.
Going back to blue-green deployment… We want it. We want it. We want it. Why? We’re not sure, but we want it now. OK. That was enough from my side. I should not mock others! I’m being nasty and should express myself in a more politically correct way. Even though I live in Spain where political correctness still did not reach the mainstream, I should know better. Ignore what I said.
The only challenge with rolling updates (as far as I can think of) is how to test a new release before it becomes available to the public. The real question is not whether we should test deployment to production but what should be tested in production. Each environment should have its set of tests executed after deployment, and production should not be an exception.
How do others test their releases in production? Now that we have zero-downtime deployments through rolling updates, do we need blue-green as a way to test a new release in production?
We could deploy canary releases instead. It’s a practice used in quite a lot of companies. A short pause. This is one of those moments when I catch myself saying a lie. Canary releases are not used in quite a lot of companies. The majority of our industry lives around the year 1999 (especially big enterprise). Most are still struggling to understand that there is something called cloud computing, many are still using WebSphere, automated testing is an unreachable dream, and so on, and so forth. Heck, Cobol is still a thing and many did not reach further than “we will deploy it during the weekend because there will be downtime and we do not know whether it will work” approach to releasing software to production. So, I take it back. Not many companies are using canary releases but those that do know what they’re doing (and that’s a minority) adopted it. A short pause while my mind recuperates from another reset. I got off the topics, again.
In a nutshell, a canary release would start by updating some of the replicas, test whether the update works as expected, increase the number of replicas with the new release, test some more, and so on. The process is repeated until all the replicas run the new release. Throughout the process, we would observe logs, monitor some metrics, and what so not. Even though there are a few things left to be desired (see issue 26160), we can do canary releases with
docker service update command.
My understanding is that the challenge might be in testing canary releases. Personally, I don’t worry whether a new feature is working as expected from the functional point of view. Since containers are immutable (at least when done right), I am very confident that what I tested in integration environment is the same as what I’ll deploy to production. Immutability vastly reduces the things we need to verify when running a new release in production. Performance should also not be an issue because that should also be tested before deploying a new release to production. That does not mean that we should not monitor production and make sure that each release is performing as expected. Quite the contrary. The thing is that blue-green deployment does not help us with that. It allows us to run our performance tests, but that’s what testing environments are for. What it does not allow us (and canary releases do) is to test (monitor) performance when “real” users use a new release and gradually increase the reach of the release.
So, we do not need to test new features in production, nor we should run performance tests in production (excluding monitoring which is testing in a way). The only thing I can think of as a missing validation is whether users find a new feature useful. For that, blue-green does not help much since users cannot see the release until it is available to the public. Canary releases handle that much better since we can monitor things gradually and, depending on metrics (e.g. do they click the new button) decide whether to continue increasing the number of replicas that are updated with the new release. If canary releases are not your thing, there are feature toggles. Or just use both…
What I’m trying to say is that I’m not sure how useful would be to have blue-green deployment considering that zero-downtime deployments are provided out-of-the-box through rolling updates. Through immutability that containers give us, we can be confident that what runs in production is exactly the same as what was tested. So, why do we want blue-green releases? I can come up with only a few possible answers to that question.
- We want it because we are used to it. We are trying to apply logic we used before we adopted containers and started orchestrating our cluster with Docker Swarm (Mode).
- We do not know that there is such a thing as rolling updates baked into Docker engine.
- We are not confident in immutability. We do not believe that what was tested in other environments will be the same as what we’ll deploy to production.
None of those answers is valid. Our industry is changing very fast, and we cannot allow ourselves to do something only because we’re used to it. Docker Swarm (and rolling updates) were introduced half a year ago, so there’s no excuse for not knowing it exists. If our container adoption did not provide immutability, the chances are that there is something wrong with our system’s architecture. Containers require changes on many levels, and if we are not ready to implement those changes, maybe we should not adopt containers just yet.
I’m not trying to say that there is no use-case for blue-green deployment with Swarm Mode. I just have a hard time to see it myself. If it would come out of the box, sure, why not. We could use them in some cases and switch to rolling updates in other. But they are not included. We need to build a custom logic to support it, and I am failing to find good enough arguments for such an effort.
Quite a few of you are asking me for blue-green releases. Is it only because I advocated in their favor while we were using the old Swarm? If that’s the case, that Swarm is gone, and many of the assumptions we had are gone with it.
I would like to get your opinion. Do you want to use blue-green deployments with Swarm Mode or you switched to rolling updates? If you do, would it make sense to create a project that would enable blue-green releases in Swarm Mode or you’d prefer custom every-men-for-himself scripts?
Join us in DevOps20 Slack Channel and let’s discuss your use case.
The DevOps 2.1 Toolkit: Docker Swarm
If you liked this article, you might be interested in The DevOps 2.1 Toolkit: Docker Swarm book. Unlike the previous title in the series (The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices) that provided a general overlook of some of the latest DevOps practices and tools, this book is dedicated entirely to Docker Swarm and the processes and tools we might need to build, test, deploy, and monitor services running inside a cluster.
You can get a copy from Amazon.com (and the other worldwide sites) or LeanPub. It is also available as The DevOps Toolkit Series bundle.
Give the book a try and let me know what you think.
Why not have both? One use case could be “I want 20% of my users to see the new release”. Something like https://github.com/stevvooe/sillyproxy but I’m trying to set this up with Swarm and nginx.
That scenario is easy. Just do a rolling update with 20% increment with a very long delay.
I tottaly agree with Farcic.. And with part of the replicas updated to a new version, things inside a log can identify the behavior of the new version being applied to production.
Why do we want blue-green releases? To support sticky sessions. We need to spin up a ‘green’ container containing our new code which accepts new sessions while keeping requests associated with existing sessions going to the ‘blue’ container. We would then like to shut down the ‘blue’ container when either a) it has no more active sessions or b) when some time limit is reached.
Ideally all our applications would be stateless or keep session state in a distributed cache, but like many folks we have legacy applications and our client will not pay us to make those kind of changes.
You’re right. I would like to get more info about your use case and plans. Are you using Swarm? If you are, how are you bypassing Docker’s new networking? It load balances requests making sticky sessions (almost) impossible. I guess you are disabling Overlay network but then you are removing probably the most interesting features introduced with the Swarm mode.
I would really appreciate if you could share your implementation within Swarm and/or your future plans.
We will be moving our applications over to Docker / Swarm during 2017. At this point unless an upcoming version of Swarm / Services supports sticky sessions we are looking at not scaling our services. Thus, we will have the App_A_blue service of scale = 1 which has two containers: App_A_Nginx and App_A_Tomcat. During code deployments we would spin up a new App_A_green service with our new versions of HTML content and Java classes / property files in the two containers. In front of this would be HAProxy which would handle sending requests for existing sessions to App_A_blue, while sending new requests to App_A_green. Eventually HAProxy would direct all traffic to App_A_green. This will most likely be controlled via Ansible.
We lose some of the benefits of Docker Services, but we do keep the feature that if one of our containers goes down Docker will restart a new instance and then direct traffic to it.
We could look at something like Kubernetes, but at this point we plan to stick with Docker provided tools. Hopefully a future version of Docker / Swarm / Services will ‘natively’ support sticky sessions.
You’re right. The plan of deploying each instance as a separate service is probably the best course of action with sticky sessions. Doing BG should not be a problem in that case. You can “hard-code” service names in your proxy and just make sure to switch from one set to another.
IMO, B/G deployments are useful for:
testing differents apps (UX, GFX, functionnalities…). This requires to stick users to a version of the application, and to closely monitor their behaviour (do they click more if the button is round or if it’s square ?). That could be addressed with the 20% rolling deployment, but it requires at least sticky sessions .
Routing users to different API versions. Thinking of a mobile app, where some users deploy the 2.x app on their phone and thus need to talk to the 2.x API. You have to run in parallel API 1.x for legacy users, and 2.x for up-to-date users. This requires a clever proxy that routes depending on some HTTP headers.
Actually both cases would be well handled by a frontend proxy/router that would route depending a specially defined policy. And that seems out of the scope of Docker.
By the way: I loved the «brainstorming» writeup. I hate to write such some articles because I always feel my ideas are too much of a mess. But you brilantly managed it, and I enjoyed reading your article. Maybe your ideas are better ordered than mine! 😉
You’re right. There are use cases for BG. My understanding is that most of the revolve around sticky sessions which are very hard with Swarm. Using sticky sessions means not using Swarm’s networking thus removing probably the main benefits it gives us. It’s a hard choice. Some might want to wait a while longer before adopting Swarm while others might choose to get rid of sticky sessions (Redis?). I think that the main problem with sticky sessions is not whether we use BG or rolling updates but Swarm networking.
As for “Routing users to different API versions”… I think that the solution to this problem is to version APIs within the same service or run different versions as separate services in parallel. BG allows us to deploy without downtime. It does not solve the problem of API versioning.
“This requires a clever proxy that routes depending on some HTTP headers.” This is true, but not really related to deployment mechanisms.
“Actually both cases would be well handled by a frontend proxy/router that would route depending a specially defined policy. And that seems out of the scope of Docker.” Now we’re getting somewhere :). If you submit an issue to https://github.com/vfarcic/docker-flow-proxy, I’ll be more than happy to add whatever is needed to make routing based on HTTP headers.
“By the way: I loved the «brainstorming» writeup. I hate to write such some articles because I always feel my ideas are too much of a mess. But you brilantly managed it, and I enjoyed reading your article. Maybe your ideas are better ordered than mine! 😉”
Thank you for this. I’ll make sure to start recording my “brainstorming while showering” sessions 🙂
Thanks a lot for your email!
My dearest Viktor, we show a really nice example of this situation in the Dublin Docker Meetup, with the people from @Calico , @Istio and @Envoy running on @Kubernetes, definitely is something to take a look and maybe who knows if that could apply into Docker swarm … in some way that is a new piece that will give a lot of power more than proxy or routing rules (as this is covering different network layers not just app layer)
In fact , this is in version 0.0.1 or something like that but is promising! We show apply that rules in real time and applying with real traffic without any problems!
Looking forward to meet again my non Catalan friend!
To add another possible reason for B-G deployment: data formats/schema.
When deploying a new app using a new data format or schema, then you need to ensure that older versions of that app can at least read data in the new format and maybe even write changes without breaking stuff. The only way known to me to reducing such issues is to use an event store with no (or at least minimal) schema.
That can be done in DBs both with and without schema. The trick is to always make changes that are backwards compatible. If iterations are short, that is usually not a problem assuming that one of the tools that manage DB schemas are used as part of the CD pipeline. No matter whether BG or rolling updates are used, there will be a period when both releases coexist. I do not think it makes a difference whether that time is a millisecond or longer. DB changes need to be backwards compatible if zero-downtime deployments are to be applied.