Thumbnail image

Gunicorn in Containers

There is a dearth of information on how to run a Flask/Django app on Kubernetes using gunicorn (mostly for good reason!). What information is available is often conflicting and confusing. Based on issues I’ve seen with my customers at Defiance Digital in the last year or so, I developed a test repository to experiment with different configurations and see which is best.

tl;dr The conventional wisdom to use multiple workers in a containerized instance of Flask/Django/anything that is served with gunicorn is incorrect - you should only use one worker per container, otherwise you’re not properly using the resources allocated to your application. Using multiple workers per container also runs the risk of OOM SIGKILLs without logging, making diagnosis of issues much more difficult than it would be otherwise.

The Issues

We currently have three customers using Django in containers, two of which run those containers in Kubernetes.

Two customers running Django in Kubernetes? Weird that it happened twice...

The two customers in Kubernetes would frequently run into gunicorn Worker Timeout errors, causing their applications to block and fail regularly. Their developers spent a bit of time investigating but, as often happens with this issue, the logs had no helpful information. They came to us to help discover what the problem was.

Each customer was following the documentated gunicorn suggestion for number of workers - (2 x $num_cores) + 1. This is a sensible place to start, especially considering the docs make the claim that you only need 4-12 workers to handle hundreds of thousands of requests per second. Our customers are startups and were receiving a maximum of 400-500 requests per second, so in theory even a single worker should be able to meet their traffic needs.

The gunicorn FAQ has a section on why workers are silently killed. Here, it suggests that when you have workers that are killed without logs, it’s due to the OS-level OOM killer. At first we dismissed this as the root cause because our monitoring showed we almost never used more than 50% of the available memory in a given container. However, upon further investigation, we couldn’t find any other reason for workers silently dying.

The Results

Armed with the working theory that memory utilization was causing the problems, I came up with a further hypothesis that the worker processes were splitting the available resources in the container. This would explain why we saw OOM kills but utilization never reached above 50% - if one worker hit its max capacity in a spike of utilization, that would be enough to trigger an OOM kill without utilizing the container’s full resources.

Using a Docker Desktop Kubernetes cluster on my work Macbook, I wrote a repeatable test and shared it as an open source repository: https://github.com/Defiance-Digital/gunicorn-k8s-test. This test confirmed my theory - each worker split its container resources evenly among the group!

We made the change to allow only a single worker per container, as the container itself was a discrete worker already, and immediately saw the resource utilization of the container increase. We also noted that the spikes causing worker OOM kills did, in fact, reach the expected levels in our monitoring tool. As a side benefit we were able to decrease the number of containers we were running while increasing performance of each container, leading to a net cost savings as we no longer needed as many host nodes.

Ultimately, while this change decreased the number of Worker Timeout errors in our customer applications, it did not solve the problem entirely. The applications were making inefficient SQL queries that took longer than the configured timeout, which caused silent failures. After solving both issues the number of Worker Timeout errors in these customer applications dropped to zero.

The Conclusion

Container orchestrators like Kubernetes already effectively function as process managers. Using additional process managers inside of a given container is redundant and, as we see in the experiment shared above, confuses the issue of resource allocation inside your cluster. The advice provided by gunicorn is good for a virtual machine or bare metal, but when running on containers, use one worker per container and increase the number of containers to scale with your workload.