Gunicorn in Containers
There is a dearth of information on how to run a Flask/Django app on Kubernetes using gunicorn (mostly for good reason!). What information is available is often conflicting and confusing. Based on issues I’ve seen with my customers at Defiance Digital in the last year or so, I developed a test repository to experiment with different configurations and see which is best.
tl;dr The conventional wisdom to use multiple workers in a containerized instance of Flask/Django/anything that is served with gunicorn is incorrect - you should only use one or two workers per container, otherwise you’re not properly using the resources allocated to your application. Using multiple workers per container also runs the risk of OOM SIGKILLs without logging, making diagnosis of issues much more difficult than it would be otherwise.
The Issues
We currently have three customers using Django in containers, two of which run those containers in Kubernetes.
The two customers in Kubernetes would frequently run into gunicorn Worker Timeout errors, causing their applications to block and fail regularly. Their developers spent a bit of time investigating but, as often happens with this issue, the logs had no helpful information. They came to us to help discover what the problem was.
Each customer was following the documentated gunicorn suggestion for number of workers - (2 x $num_cores) + 1. This is a sensible place to start, especially considering the docs make the claim that you only need 4-12 workers to handle hundreds of thousands of requests per second. Our customers are startups and were receiving a maximum of 400-500 requests per second, so in theory even a single worker should be able to meet their traffic needs. Note, however, that the documentation does not recommend setting your worker count based on projected traffic.
The gunicorn FAQ has a section on why workers are silently killed. Here, it suggests that when you have workers that are killed without logs, it’s due to the OS-level OOM killer. At first we dismissed this as the root cause because our monitoring showed we almost never used more than 50% of the available memory in a given container. However, upon further investigation, we couldn’t find any other reason for workers silently dying.
The Results
Armed with the working theory that memory utilization was causing the problems, I came up with a further hypothesis that the worker processes were splitting the available resources in the container. This would explain why we saw OOM kills but utilization never reached above 50% - if one worker hit its max capacity in a spike of utilization, that would be enough to trigger an OOM kill without utilizing the container’s full resources.
Using a Docker Desktop Kubernetes cluster on my work Macbook, I wrote a repeatable test and shared it as an open source repository: https://github.com/Defiance-Digital/gunicorn-k8s-test. This test confirmed my theory - each worker split its container resources evenly among the group!
We made the change to allow only a single worker per container, as the container itself was a discrete worker already, and immediately saw the resource utilization of the container increase. We also noted that the spikes causing worker OOM kills did, in fact, reach the expected levels in our monitoring tool. As a side benefit we were able to decrease the number of containers we were running while increasing performance of each container, leading to a net cost savings as we no longer needed as many host nodes.
Ultimately, while this change decreased the number of Worker Timeout errors in our customer applications, it did not solve the problem entirely. The applications were making inefficient SQL queries that took longer than the configured timeout, which caused silent failures. After solving both issues the number of Worker Timeout errors in these customer applications dropped to zero.
Additional Lessons
Over time, we discovered that there were two underlying issues we continued to see with our customers:
- The load balancers sending traffic to our pods or tasks had no way of knowing how many workers might be available in a given pod, since the master gunicorn process manages that. This meant any time a synchronous task was still running, the master process might be accumulating a backlog of other requests, including health checks.
- With only one worker per pod/task, there was no way to handle simple requests to a given container.
After further reading into the docs, we discovered that gunicorn suggests using asynchronous workers for any task exceeding 500ms. The application was running long queries to databases or periodically failing external APIs, which sometimes exceeded the configured timeout and continued to cause Worker Timeouts.
We’re working with our customers to move those tasks to asynchronous workers and making sure their logs contain the process ID of each worker, so we can more easily correlate worker timeouts to the code that timed out. We’re also suggesting we increase to 2 workers per pod so it reduces the risk of failure in a single pod, with the important caveat that we double the resources available so we don’t run into the same issues we had before.
Additionally, it’s important to include the process ID in your logs so you can correlate worker timeouts to the code that timed out. This is simple to do - for example:
import logging
logging.basicConfig(format=f'%(asctime)s - %(process)d - %(levelname)s - %(message)s')
The Conclusion
Container orchestrators like Kubernetes already effectively function as process managers. Using additional process managers inside of a given container is redundant and, as we see in the experiment shared above, confuses the issue of resource allocation inside your cluster. The advice provided by gunicorn is good for a virtual machine or bare metal, but when running on containers, use one or two workers per container and increase the number of containers and size of container resources to scale with your workload.
Gunicorn was not designed to run into containers in this way, but we may be able to work around that with the use of a good proxy like nginx.