Fastest cache backend possible for Django

Friday, Apr 7, 2017
11 comments Python, Linux, Web development

tl;dr; Redis is twice as fast as memcached as a Django cache backend when installed using AWS ElastiCache. Only tested for reads.

Django has a wonderful caching framework. I think I say "wonderful" because it's so simple. Not because it has a hundred different bells or whistles. Each cache gets a name (e.g. "mymemcache" or "redis append only"). The only configuration you generally have to worry about is 1) what backed and 2) what location.

For example, to set up a memcached backend:


# this in settings.py
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'KEY_PREFIX': 'myapp',
        'LOCATION': config('MEMCACHED_LOCATION', '127.0.0.1:11211'),
    },
}

With that in play you can now do:


>>> from django.core.cache import caches
>>> caches['default'].set('key', 'value', 60)  # 60 seconds
>>> caches['default'].get('key')
'value'

Django comes without built-in backend called django.core.cache.backends.locmem.LocMemCache which is basically a simply Python object in memory with no persistency between Python processes. This one is of course super fast because it involves no further network (local or remote) beyond the process itself. But it's not really useful because if you care about performance (which you probably are if you're here because of the blog post title) because it can't be reused amongst processes.

Anyway, the most common backends to use are:

Memcached
Redis

These are semi-persistent and built for extremely fast key lookups. They can both be reached over TCP or via a socket.

What I wanted to see, is which one is fastest.

The Experiment

First of all, in this blog post I'm only measuring the read times of the various cache backends.

Here's the Django view function that is the experiment:


from django.conf import settings
from django.core.cache import caches

def run(request, cache_name):
    if cache_name == 'random':
        cache_name = random.choice(settings.CACHE_NAMES)
    cache = caches[cache_name]
    t0 = time.time()
    data = cache.get('benchmarking', [])
    t1 = time.time()
    if random.random() < settings.WRITE_CHANCE:
        data.append(t1 - t0)
        cache.set('benchmarking', data, 60)
    if data:
        avg = 1000 * sum(data) / len(data)
    else:
        avg = 'notyet'
    # print(cache_name, '#', len(data), 'avg:', avg, ' size:', len(str(data)))
    return http.HttpResponse('{}\n'.format(avg))

It records the time to make a cache.get read and depending settings.WRITE_CHANCE it also does a write (but doesn't record that).
What it records is a list of floats. The content of that piece of data stored in the cache looks something like this:

[0.0007331371307373047]
[0.0007331371307373047, 0.0002570152282714844]
[0.0007331371307373047, 0.0002570152282714844, 0.0002200603485107422]

So the data grows from being really small to something really large. If you run this 1,000 times with settings.WRITE_CACHE of 1.0 the last time it has to fetch a list of 999 floats out of the cache backend.

You can either test it with 1 specific backend in mind and see how fast Django can do, say, 10,000 of these. Here's one such example:

$ wrk -t10 -c400 -d10s http://127.0.0.1:8000/default
Running 10s test @ http://127.0.0.1:8000/default
  10 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    76.28ms  155.26ms   1.41s    92.70%
    Req/Sec   349.92    193.36     1.51k    79.30%
  34107 requests in 10.10s, 2.56MB read
  Socket errors: connect 0, read 0, write 0, timeout 59
Requests/sec:   3378.26
Transfer/sec:    259.78KB

$ wrk -t10 -c400 -d10s http://127.0.0.1:8000/memcached
Running 10s test @ http://127.0.0.1:8000/memcached
  10 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    96.87ms  183.16ms   1.81s    95.10%
    Req/Sec   213.42     82.47     0.91k    76.08%
  21315 requests in 10.09s, 1.57MB read
  Socket errors: connect 0, read 0, write 0, timeout 32
Requests/sec:   2111.68
Transfer/sec:    159.27KB

$ wrk -t10 -c400 -d10s http://127.0.0.1:8000/redis
Running 10s test @ http://127.0.0.1:8000/redis
  10 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    84.93ms  148.62ms   1.66s    92.20%
    Req/Sec   262.96    138.72     1.10k    81.20%
  25271 requests in 10.09s, 1.87MB read
  Socket errors: connect 0, read 0, write 0, timeout 15
Requests/sec:   2503.55
Transfer/sec:    189.96KB

But an immediate disadvantage with this is that the "total final rate" (i.e. requests/sec) is likely to include so many other factors. However, you can see that LocMemcache got 3378.26 req/s, MemcachedCache got 2111.68 req/s and RedisCache got 2503.55 req/s.

The code for the experiment is available here: https://github.com/peterbe/django-fastest-cache

The Infra Setup

I created an AWS m3.xlarge EC2 Ubuntu node and two nodes in AWS ElastiCache. One 2-node memcached cluster based on cache.m3.xlarge and one 2-node 1-replica Redis cluster also based on cache.m3.xlarge.

The Django server was run with uWSGI like this:

uwsgi --http :8000 --wsgi-file fastestcache/wsgi.py  --master --processes 6 --threads 10

The Results

Instead of hitting one backend repeatedly and reporting the "requests per second" I hit the "random" endpoint for 30 seconds and let it randomly select a cache backend each time and once that's done, I'll read each cache and look at the final massive list of timings it took to make all the reads. I run it like this:

wrk -t10 -c400 -d30s http://127.0.0.1:8000/random && curl http://127.0.0.1:8000/summary
...wrk output redacted...

                         TIMES        AVERAGE         MEDIAN         STDDEV
memcached                 5738        7.523ms        4.828ms        8.195ms
default                   3362        0.305ms        0.187ms        1.204ms
redis                     4958        3.502ms        1.707ms        5.591ms

Best Averages (shorter better)
###############################################################################
█████████████████████████████████████████████████████████████  7.523  memcached
██                                                             0.305  default
████████████████████████████                                   3.502  redis

Things to note:

Redis is twice as fast as memcached.
Pure Python LocMemcache is 10 times faster than Redis.
The table reports average and median. The ASCII bar chart shows only the averages.
All three backends report huge standard deviations. The median is very different from the average.
The average is probably the more interesting number since it more reflects the ups and downs of reality.
If you compare the medians, Redis is 3 times faster than memcached.
It's luck that Redis got fewer datapoints than memcached (4958 vs 5738) but it's as expected that the LocMemcache backend only gets 3362 because the uWSGI server that is used is spread across multiple processes.

Other Things To Test

Perhaps pylibmc is faster than python-memcached.

TIMES        AVERAGE         MEDIAN         STDDEV
pylibmc                   2893        8.803ms        6.080ms        7.844ms
default                   3456        0.315ms        0.181ms        1.656ms
redis                     4754        3.697ms        1.786ms        5.784ms

Best Averages (shorter better)
###############################################################################
██████████████████████████████████████████████████████████████   8.803  pylibmc
██                                                               0.315  default
██████████████████████████                                       3.697  redis

Using pylibmc didn't make things much faster. What if we we pit memcached against pylibmc?:

TIMES        AVERAGE         MEDIAN         STDDEV
pylibmc                   3005        8.653ms        5.734ms        8.339ms
memcached                 2868        8.465ms        5.367ms        9.065ms

Best Averages (shorter better)
###############################################################################
█████████████████████████████████████████████████████████████  8.653  pylibmc
███████████████████████████████████████████████████████████    8.465  memcached

What about that fancy hiredis Redis Python driver that's supposedly faster?

TIMES        AVERAGE         MEDIAN         STDDEV
redis                     4074        5.628ms        2.262ms        8.300ms
hiredis                   4057        5.566ms        2.296ms        8.471ms

Best Averages (shorter better)
###############################################################################
███████████████████████████████████████████████████████████████  5.628  redis
██████████████████████████████████████████████████████████████   5.566  hiredis

These last two results are both surprising and suspicious. Perhaps the whole setup is wrong. Why wouldn't the C-based libraries be faster? Is it so incredibly dwarfed by the network I/O in the time between my EC2 node and the ElastiCache nodes?

In Conclusion

I personally like Redis. It's not as stable as memcached. On a personal server I've run for years the Redis server sometimes just dies due to corrupt memory and I've come to accept that. I don't think I've ever seen memcache do that.

But there are other benefits with Redis as a cache backend. With the django-redis library you have really easy access to the raw Redis connection and you can do much more advanced data structures. You can also cache certain things indefinitely. Redis also supports storing much larger strings than memcached (1MB for memcached and 512MB for Redis).

The conclusion is that Redis is faster than memcached by a factor of 2. Considering the other feature benefits you can get out of having a Redis server available, it's probably a good choice for your next Django project.

Bonus Feature

In big setups you most likely have a whole slur of web heads that are servers that do nothing but handle web requests. And these are configured to talk to databases and caches over the near network. However, many of us have cheap servers on DigitalOcean or Linode where we run web servers, relational databases and cache servers all on the same machine. (I do. This blog is one of those where there is Nginx, Redis, memcached and PostgreSQL on a 4GB DigitalOcean SSD Ubuntu).

So here's one last test where I installed a local Redis and a local memcached on the EC2 node itself:

$ cat .env | grep 127.0.0.1
MEMCACHED_LOCATION="127.0.0.1:11211"
REDIS_LOCATION="redis://127.0.0.1:6379/0"

Here are the results:

TIMES        AVERAGE         MEDIAN         STDDEV
memcached                 7366        3.456ms        1.380ms        5.678ms
default                   3716        0.263ms        0.189ms        1.002ms
redis                     5582        2.334ms        0.639ms        4.965ms

Best Averages (shorter better)
###############################################################################
█████████████████████████████████████████████████████████████  3.456  memcached
████                                                           0.263  default
█████████████████████████████████████████                      2.334  redis

The conclusion of that last benchmark is that Redis is still faster and it's roughly 1.8x faster to run these backends on the web head than to use ElastiCache. Perhaps that just goes to show how amazingly fast the AWS inter-datacenter fiber network is!

Comments

Post your own comment

Cezar April 25, 2017

I've run redis with a million users without it crashing for years at a time. Not sure what's up with your install.

Peter Bengtsson April 25, 2017

My problem was that the server ran out of memory. And instead of grinding to a halt, it crashes.

Grant Jenks May 18, 2017

Great to see more benchmarks out there. In my local testing, Memcached often outperforms Redis by a fair margin. Seems like AWS ElastiCache is having a large effect on the results. You may be interested in my DiskCache project: http://www.grantjenks.com/docs/diskcache/ which also benchmarks Django cache backends.

Peter Bengtsson May 19, 2017

How did you measure memcache outperformed Redis?

Also, disk cache seems dangerously inefficient when you have multiple web heads.
Mind you, this while blog is served from disk. Django renders HTML which is dumped on disk and Nginx renders that, and falls back to Django (via uWSGI).

Grant Jenks May 19, 2017

Similar setup to yours. Initialize Django with multiple caches then test the performance of get/set/delete operations. Script in Github repo at tests/benchmark_djangocache.py Results are here: http://www.grantjenks.com/docs/diskcache/djangocache-benchmarks.html I'm measuring multiple percentiles: 50th (median), 90th, 99th, and max.

I'm not sure what "multiple web heads" refers to but I will guess you mean a load balancer with multiple servers behind it. That is a scenario where you've probably outgrown DiskCache. Although if you load balance consistently (say by ip-hash) then the issue can be mitigated.

axe leo February 7, 2019

good stuff, i will try out redis and switch out memcached... wanted to do this for a while

Manu Lauria March 10, 2019

I was reading an article on why going over the network/socket is so much slower than doing it natively in Python (https://www.prowesscorp.com/computer-latency-at-a-human-scale/), and was wondering how locMem does it.

I am a newbie,and had just started searching, and almost immediately hit this article. This experiment confirms that the locMem cache is in fact in the process's own memory and not gotten through a socket. The socket, howsoever light it is, will presumably always be outperformed by local access. And then, one has to 'protect' socket 11211 from infiltrators (https://www.digitalocean.com/community/tutorials/how-to-secure-memcached-by-reducing-exposure) - not difficult, but still ...

The locmem cache's disadvantage is that its data is not shared across processes, and is not persistent (your process goes down, the cache gets busted). To me, given how complicated cache busting is, this looks like an advantage. Cache busting becomes as easy as shutting things down, and bringing them up again.

To 'fill' the cache after a shutdown automatically, I thought of two approaches -
1. Run "ab" with N parallel connections for a few seconds with all the target endpoints, where N is larger than the number of workers (in my case gunicorn sockets), forcing each to fill up their caches.
2. Have a special end-point, that just fills up the caches. I implemented the first, being lazy and not wanting to do the extra coding.

So each time I 'goDown' and 'bringUp' my code base, I just do a 'fillUp' using a script which calls ab. I do have dynamic pages, but large parts of each dynamic page is in fact static, so I cache fragments for repeat users, and full pages for first time visitors to each endpoint (those that come without a cookie).

I am worried I am doing it all wrong! Beacuse the other disadvantage of locMem is that each process will use up memory for its own cache. In my case, the caches are not big. If they do get bigger, I will try a shared memory approach, since that would still be much faster than sockets (http://pages.cs.wisc.edu/~adityav/Evaluation_of_Inter_Process_Communication_Mechanisms.pdf).

(This is my first project with Django, and I am using nginx, postgres, gunicorn, python, django, locMem, on a single CPU droplet on DOcean, with 3 gunicorn workers, static/media files being served from that droplet itself rather than AWS).

Peter Bengtsson March 11, 2019

If you have a single CPU droplet, doesn't that mean you only have 1 CPU and thus there's no point using 3 gunicorn workers because two of them don't have distinct CPUs to use.

And it's true. If you do have, say, 4 CPUs and 4 gunicorn workers, suppose that you fill all your caches, you're going to need to use 4x as much RAM memory compared to letting all be handled by 1 memcache or redis.

Anonymous March 11, 2019

the 3 gunicorn workers is as per the recommendation from gunicorn documentation (2*num_processors + 1).
http://docs.gunicorn.org/en/stable/design.html
My cache is small - max 500 items (static fragments of pages) X max 200kb = 100Mb. With 4Gb RAM, I thought I could afford the duplication.

Michael Herman June 4, 2019

Any thoughts on using `LocMemCache` with the Gunicorn preload (http://docs.gunicorn.org/en/stable/settings.html#preload-app) option to share a section of memory across workers? When would this be appropriate to use?

Peter Bengtsson June 4, 2019

Or you could just use fewer workers and a bunch of threads. Then you could code with global mutables without worrying about thread-safety. All in all, sounds scary but there might be cases where there are performance benefits.

The obvious drawbacks are that you won't be able to reach that cache data from somewhere else. E.g. `./manage.py post-process-cached-things` and as soon as you destroy the gunicorn worker all that memory is lost and needs to be rebuilt. If you just stuff it in a db like Redis, the you can destroy and create new web workers without any risk of empty caches or stampeding herd problems.

Go to top of the page