This text was taken from the book and a Udemy course The DevOps Toolkit: Catalog, Patterns, And Blueprints
Should we use managed Containers as a Service (CaaS)? That must be the most crucial question we should try to answer. Unfortunately, it is hard to provide a universal answer since the solutions differ significantly from one provider to another. Currently (July 2020), CaaS can be described as wild west with solutions ranging from amazing to useless.
Before we attempt to answer the big question, let's go through some of the things we learned by exploring Google Cloud Run, AWS ECS with Fargate, and Azure Container Instances.
We can compare those three from different angles. One of those can be simplicity. After all, ease of use is one of the most essential benefits of serverless computing. It is supposed to allow engineers to provide code or binaries (in one form or another) with a reasonable expectation that the platform of choice will do most of the rest of the work.
From the simplicity perspective, both Google Cloud Run and Azure Container Instances are exceptional. They allow us to deploy our container images without almost any initial setup. Google needs only a project, while Azure requires only a resource group.
On the other hand, AWS needs over twenty different bits and pieces (resources) to be assembled before we can even start thinking about deploying something to ECS. Even after all the infrastructure is set up, we need to create a task definition, a service, and a container definition. If simplicity is what you're looking for, ECS is not it. It's horrifyingly complicated, and it's far from "give us a container image, we'll take care of the rest" approach we are all looking for when switching to serverless deployments. Surprisingly, a company that provides such amazing Functions as a Service solution (Lambda) did not do something similar with ECS. If AWS took the same approach with ECS as with Lambda, it would likely be the winner. But it didn't, so I am going to give it a huge negative point.
From the simplicity of setup and deployment perspective, Azure and Google are clear winners.
Now that we mentioned infrastructure in the context of the initial setup, we might want to take that as a criterion as well.
There is no infrastructure for us to manage when using CaaS in Google Cloud or Azure. They take care of all the details. AWS, on the other hand, forces us to create a full-blown cluster. That alone can disqualify AWS ECS with Fargate from being considered as a serverless solution. I'm not even sure whether we could qualify it as Containers as a Service. As a matter of fact, I would prefer using Elastic Kubernetes Engine (EKS). It's just as easy, if not easier, than ECS and, at least, it adheres to widely accepted standards and does not lock us into a suboptimal proprietary solution from which there is no escape.
How about scalability? Do our applications scale when deployed into managed Containers as a Service solutions? The answer to that question changes the rhythm of this story.
Google Cloud Run is scalable by design. It is based on Knative, which is a Kubernetes resource designed for serverless workloads. It scales without us even specifying anything. Unless we overwrite the default behavior, it will create a replica of our application for every hundred concurrent requests. If there are no requests, no replicas will run. If it jumps to three hundred, it will scale to three replicas. It will queue requests if none of the replicas can handle them, and scale up and down to accommodate fluctuations in traffic. All that will happen without us providing any specific information. It has sane defaults while still providing the ability to fine-tune the behavior to match our particular needs.
Applications deployed to ECS are scalable as well. But it is not easy.
Scaling applications deployed to ECS is complicated and limiting. Even if we can overlook those issues, it does not scale to zero replicas. At least one replica of our application needs to run at all times since there is no built-in mechanism to queue requests and spin up new replicas. From that perspective, scaling applications in ECS is not what we would expect from serverless computing. It is similar to what we would get from HorizontalPodAutoscaler in Kubernetes. It can go up and down, but never to zero replicas. Given that there is a scaling mechanism of sorts, but that it cannot go down to zero replicas and that it is limiting in what it can actually do, I can only say that ECS only partially fulfills the scalability needs of our applications, at least in the context of serverless computing.
How about Azure Container Instances?
Unlike Google Cloud Run and ECS, it does not use a scheduler. There is no scaling of any kind. All we can do is run single replica containers isolated from each other. That alone means that Azure Container Instances cannot be used in production for anything but small businesses. Even in those cases, it is still not a good idea to use ACI for production workloads. The only use-case I can imagine would be for situations in which your application cannot scale. If you have one of those old, often stateful applications that can run only in single-replica mode, you might consider Azure Container Instances. For anything else, the inability to scale is a show stopper.
Simply put, Azure Container Instances provide a way to run Docker containers in Cloud. There is not much more to it, and we know that Docker alone is not enough for anything but development purposes.
I would say that even development with Docker alone is not a good idea, but that would open a discussion that I want to leave for another time.
Another potentially important criterion is the level of lock-in. ECS (with or without Fargate) is fully proprietary and forces us to rely entirely on AWS. The amount of resources we need to create and the format for writing application definitions ensures that we are locked into AWS. If you choose to use it, you will not be able to move anywhere else, at least not easily. That does not necessarily mean that the benefits do not outweigh the potential cost behind being locked-in, but, instead, that we might need to be aware of it when making the decision whether to use it or not.
The issue with ECS is not lock-in itself. There is nothing wrong with using proprietary solutions that solve problems in a better way than open alternatives. The problem is that ECS is by no means any better than Kubernetes. As a matter of fact, it is a worse solution. So, the problem with being locked into ECS is that you are locked into a service that is not as good as the more open counterpart provided by the same company (AWS EKS). That does not mean that EKS is the best managed Kubernetes service (it is not), but that, within the AWS ecosystem, it is probably a better choice.
Azure Container Instances are also fully proprietary but, given that all the investment is in creating container images and running a single command, you will not be locked. The investment is very low, so if you choose to switch to another solution or a different provider, you should be able to do that with relative ease.
Google Container Run is based on Knative, which is open source and open standard. Google is only providing a layer on top of it. You can even deploy it using Knative definitions, which can be installed in any Kubernetes cluster. From the lock-in perspective, there is close to none.
How about high-availability?
Google Cloud Run was the only solution that did not produce
100 % availability in our tests with
siege. So far, that is the first negative point we could give it. That is a severe downside. That does not mean that it is not highly available, but rather that it tends to produce only a few nines after the decimal (e.g.,
99.99). That's not a bad result by any means. If we did more serious testing, we would see that over a more extended period and with a higher number of requests, the other solutions would also drop below
100 % availability. Nevertheless, with a smaller sample, Azure Container Instances and AWS ECS did produce better results than Google Cloud Run, and that is not something we should ignore.
Azure Container Instances, on the other hand, can handle only limited traffic. The inability to scale horizontally inevitably leads to failure to be highly-available. We did not experience that will our tests with
siege mostly because a single replica was able to handle thousand concurrent requests. If we increased the load, it would start collapsing by reaching the limit of what one replica can handle.
On the other hand, ECS provides the highest availability, as long as we set up horizontal scaling. We need to work for it.
Finally, the most important question to answer is whether any of those services is production-ready.
We already saw that Azure Container Instances should not be used in production, except for very specific use-cases.
Google Cloud Run and AWS ECS, on the other hand, are production-ready. Both provide all the features you might need when running production workloads. The significant difference is that ECS exists for much longer, while Google Cloud Run is a relatively new service, at least at the time of this writing (July 2020). Nevertheless, it is based on Google Kubernetes Engine (GKE), which is considered the most mature and stable managed Kubernetes we can use today. Given that Google Cloud Run is only a layer on top of GKE, we can safely assume that it is stable enough. The bigger potential problem is in Knative itself. It is a relatively new project that did not yet reach the first GA release (at the time of this writing, the latest release is 0.16.0). Nevertheless, major software vendors are behind it. Even though it might not yet be battle-tested, it is getting very close to being the preferable way to run serverless computing in Kubernetes.
To summarize, Azure Container Instances are not, and never will be, production-ready. AWS ECS is fully there, and Google Cloud Run is very close to being production-ready.
Finally, can any of those services be qualified as serverless? To answer that question, let's define what the features we expect from managed serverless computing are.
It is supposed to remove the need to manage infrastructure or, at least, to simplify it greatly. It should provide scalability and high-availability, and it should charge us for what our users use while making sure that our apps are running only when needed. We can summarize those as follows.
- No need to manage infrastructure
- Out-of-the-box scalability and high-availability
- "Pay what your users use" model
If we take those three as the base evaluation whether something is serverless or not, we can easily discard both Azure Container Instances and AWS ECS with Fargate.
Azure Container Instances service does not have out-of-the-box scalability and high-availability. As a matter of fact, it has no scalability of any kind and, therefore, it cannot be highly available. On top of that, we do not pay what our users use since it cannot scale to zero replicas, so our app is always running, no matter whether someone is consuming it. As such, our bill will be based on the amount of pre-assigned resources (memory and CPU). The only major serverless computing feature that it does provide is hands-off infrastructure.
AWS ECS with Fargate does provide some sort of scalability. It's not necessarily out-of-the-box experience, but it is there. Nevertheless, it fails to abstracts infrastructure management, and it suffers from the same problem as ACI when billing is concerned. Given that we cannot scale our applications to zero replicas when they are not used, we have to pay for resources they are consuming independently of our users' needs.
Google Cloud Run is, by all accounts, a serverless implementation of Containers as a Service. It removes the need to manage infrastructure, it provides horizontal scaling as an out-of-the-box solution while still allowing us to fine-tune the behavior. It scales to zero when not in use, so it does adhere to the "pay what your users use" model. Google Cloud Run is, without doubt, the best of the three.
Here's a table that summarizes the findings.
|Easy to use||Yes||No||Yes|
|Open (no lock-in)||Yes||No||Yes|