Filtered by Django

Page 2

Reset

A Python and Preact app deployed on Heroku

December 13, 2019
2 comments Web development, Django, Python, Docker, JavaScript

Heroku is great but it's sometimes painful when your app isn't just in one single language. What I have is a project where the backend is Python (Django) and the frontend is JavaScript (Preact). The folder structure looks like this:

/
  - README.md
  - manage.py
  - requirements.txt
  - my_django_app/
     - settings.py
     - asgi.py
     - api/
        - urls.py
        - views.py
  - frontend/
     - package.json
     - yarn.lock
     - preact.config.js
     - build/
        ...
     - src/
        ...

A bunch of things omitted for brevity but people familiar with Django and preact-cli/create-create-app should be familiar.
The point is that the root is a Python app and the front-end is exclusively inside a sub folder.

When you do local development, you start two servers:

  • ./manage.py runserver - starts http://localhost:8000
  • cd frontend && yarn start - starts http://localhost:3000

The latter is what you open in your browser. That preact app will do things like:


const response = await fetch('/api/search');

and, in preact.config.js I have this:


export default (config, env, helpers) => {

  if (config.devServer) {
    config.devServer.proxy = [
      {
        path: "/api/**",
        target: "http://localhost:8000"
      }
    ];
  }

};

...which is hopefully self-explanatory. So, calls like GET http://localhost:3000/api/search actually goes to http://localhost:8000/api/search.

That's when doing development. The interesting thing is going into production.

Before we get into Heroku, let's first "merge" the two systems into one and the trick used is Whitenoise. Basically, Django's web server will be responsibly not only for things like /api/search but also static assets such as / --> frontend/build/index.html and /bundle.17ae4.js --> frontend/build/bundle.17ae4.js.

This is basically all you need in settings.py to make that happen:


MIDDLEWARE = [
    "django.middleware.security.SecurityMiddleware",
    "whitenoise.middleware.WhiteNoiseMiddleware",
    ...
]

WHITENOISE_INDEX_FILE = True

STATIC_URL = "/"
STATIC_ROOT = BASE_DIR / "frontend" / "build"

However, this isn't quite enough because the preact app uses preact-router which uses pushState() and other code-splitting magic so you might have a URL, that users see, like this: https://myapp.example.com/that/thing/special and there's nothing about that in any of the Django urls.py files. Nor is there any file called frontend/build/that/thing/special/index.html or something like that.
So for URLs like that, we have to take a gamble on the Django side and basically hope that the preact-router config knows how to deal with it. So, to make that happen with Whitenoise we need to write a custom middleware that looks like this:


from whitenoise.middleware import WhiteNoiseMiddleware


class CustomWhiteNoiseMiddleware(WhiteNoiseMiddleware):
    def process_request(self, request):
        if self.autorefresh:
            static_file = self.find_file(request.path_info)
        else:
            static_file = self.files.get(request.path_info)

            # These two lines is the magic.
            # Basically, the URL didn't lead to a file (e.g. `/manifest.json`)
            # it's either a API path or it's a custom browser path that only
            # makes sense within preact-router. If that's the case, we just don't
            # know but we'll give the client-side preact-router code the benefit
            # of the doubt and let it through.
            if not static_file and not request.path_info.startswith("/api"):
                static_file = self.files.get("/")

        if static_file is not None:
            return self.serve(static_file, request)

And in settings.py this change:


MIDDLEWARE = [
    "django.middleware.security.SecurityMiddleware",
-   "whitenoise.middleware.WhiteNoiseMiddleware",
+   "my_django_app.middleware.CustomWhiteNoiseMiddleware",
    ...
]

Now, all traffic goes through Django. Regular Django view functions, static assets, and everything else fall back to frontend/build/index.html.

Heroku

Heroku tries to make everything so simple for you. You basically, create the app (via the cli or the Heroku web app) and when you're ready you just do git push heroku master. However that won't be enough because there's more to this than Python.

Unfortunately, I didn't take notes of my hair-pulling excruciating journey of trying to add buildpacks and hacks and Procfiles and custom buildpacks. Nothing seemed to work. Perhaps the answer was somewhere in this issue: "Support running an app from a subdirectory" but I just couldn't figure it out. I still find buildpacks confusing when it's beyond Hello World. Also, I didn't want to run Node as a service, I just wanted it as part of the "build process".

Docker to the rescue

Finally I get a chance to try "Deploying with Docker" in Heroku which is a relatively new feature. And the only thing that scared me was that now I need to write a heroku.yml file which was confusing because all I had was a Dockerfile. We'll get back to that in a minute!

So here's how I made a Dockerfile that mixes Python and Node:


FROM node:12 as frontend

COPY . /app
WORKDIR /app
RUN cd frontend && yarn install && yarn build


FROM python:3.8-slim

WORKDIR /app

RUN groupadd --gid 10001 app && useradd -g app --uid 10001 --shell /usr/sbin/nologin app
RUN chown app:app /tmp

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    gcc apt-transport-https python-dev

# Gotta try moving this to poetry instead!
COPY ./requirements.txt /app/requirements.txt
RUN pip install --upgrade --no-cache-dir -r requirements.txt

COPY . /app
COPY --from=frontend /app/frontend/build /app/frontend/build

USER app

ENV PORT=8000
EXPOSE $PORT

CMD uvicorn gitbusy.asgi:application --host 0.0.0.0 --port $PORT

If you're not familiar with it, the critical trick is on the first line where it builds some Node with as frontend. That gives me a thing I can then copy from into the Python image with COPY --from=frontend /app/frontend/build /app/frontend/build.

Now, at the very end, it starts a uvicorn server with all the static .js, index.html, and favicon.ico etc. available to uvicorn which ultimately runs whitenoise.

To run and build:

docker build . -t my_app
docker run -t -i --rm --env-file .env -p 8000:8000 my_app

Now, opening http://localhost:8000/ is a production grade app that mixes Python (runtime) and JavaScript (static).

Heroku + Docker

Heroku says to create a heroku.yml file and that makes sense but what didn't make sense is why I would add cmd line in there when it's already in the Dockerfile. The solution is simple: omit it. Here's what my final heroku.yml file looks like:


build:
  docker:
    web: Dockerfile

Check in the heroku.yml file and git push heroku master and voila, it works!

To see a complete demo of all of this check out https://github.com/peterbe/gitbusy and https://gitbusy.herokuapp.com/

Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON

September 30, 2019
2 comments Redis, Nginx, Web Performance, Python, Django, PostgreSQL

Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.

However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.

But is it entirely wrong to look at WHICH database to get the best speed?

Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx

Nginx??

The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:


const response = await fetch('https://songsear.ch/api/song/5246889/amazon');

Internally, in the app, what it does is that it looks this up, by ID, on the AmazonAffiliateLookup ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as ProductTypeName, Studio, TrackSequence etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.

Again, Nginx?

Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.

Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:

▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes

Summary:
  Total:    0.9882 secs
  Slowest:  0.0279 secs
  Fastest:  0.0001 secs
  Average:  0.0010 secs
  Requests/sec: 10119.8265


Response time histogram:
  0.000 [1] |
  0.003 [9752]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [108]   |
  0.008 [70]    |
  0.011 [32]    |
  0.014 [8] |
  0.017 [12]    |
  0.020 [11]    |
  0.022 [1] |
  0.025 [4] |
  0.028 [1] |


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0006 secs
  50% in 0.0008 secs
  75% in 0.0010 secs
  90% in 0.0013 secs
  95% in 0.0016 secs
  99% in 0.0068 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0001 secs, 0.0279 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0026 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0011 secs
  resp wait:    0.0008 secs, 0.0001 secs, 0.0206 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0013 secs

Status code distribution:
  [200] 10000 responses

10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a Cache-Control HTTP header with something like max-age=86400.

Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long Cache-Control, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).

Conclusion

If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.

SongSearch autocomplete rate now 2+ per second

July 11, 2019
0 comments Django, Python, Nginx, Redis

By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=....

Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".

Autocomplete counts in Datadog

Caching strategy

Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.

What I used to do was something like this:


def search_autocomplete(request):
   q = request.GET.get('q') 

   cache_key = None
   if len(q) < 10:
      cache_key = 'autocomplete:' + q
      results = cache.get(cache_key)
      if results is not None:
          return http.JsonResponse(results)

   results = _do_elastisearch_query(q)
   if cache_key:
       cache.set(cache_key, results, 60 * 60)

   return http.JsonResponse(results)   

However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:


def search_autocomplete(request):
   q = request.GET.get('q') 
   results = _do_elastisearch_query(q)
   response = http.JsonResponse(results)   

   if len(q) < 10:
       patch_cache_control(response, public=True, max_age=60 * 60)

   return response

The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli and...

> INFO
...
# Memory
used_memory:123904288
used_memory_human:118.16M
...

Nginx Amplify

My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:

NGINX Requests/s

NGINX Memory Usage

NGINX CPU Usage %

Thoughts and conclusion

Caching is hard. But it's also fun because it ties directly into performance work.

In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/ is only 34MB which isn't much.

Another crux with using uwsgi_cache (or proxy_cache) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip() for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"} and {"q": "a"}. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.

The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.

UPDATE (July 12 2019)

Currently, and as mentioned above, I only set Cache-Control headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:

POINT: 7
Sum show 75646 32.2%
Sum rest 159321 67.8%

POINT: 8
Sum show 83702 35.6%
Sum rest 151265 64.4%

POINT: 9
Sum show 90870 38.7%
Sum rest 144097 61.3%

POINT: 10
Sum show 98384 41.9%
Sum rest 136583 58.1%

POINT: 11
Sum show 106093 45.2%
Sum rest 128874 54.8%

POINT: 12
Sum show 113905 48.5%
Sum rest 121062 51.5%

It means that (independent of time expiry) 38.7% of queries are 9 characters or less.

Build an XML sitemap of XML sitemaps

June 1, 2019
0 comments Django, Python

Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml file that has all the URLs (aka <loc>) listed. Then you need to make a /sitemaps.xml that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.

The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.

Generate the sitemap .xml.gz file(s)

Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree trees and dumps them to disk with gzip.


import gzip

from lxml import etree


outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000


def generate(self, qs, base_url, outfile, batchsize):
    # Use `.values` to make the query much faster
    qs = qs.values("name", "id", "artist_id", "language")

    def start():
        return etree.Element(
            "urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        )

    def close(root, filename):
        with gzip.open(filename, "wb") as f:
            f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
            f.write(etree.tostring(root, pretty_print=True))

    root = filename = None

    count = 0
    for song in qs.iterator():
        if not count % batchsize:
            if filename:  # not the very first loop
                close(root, filename)
                yield filename
            filename = outfile.format(start=count, end=count + batchsize)
            root = start()
        loc = "{}{}".format(base_url, make_song_url(song))
        etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
        count += 1
    close(root, filename)
    yield filename

The most important lines in terms of lxml.etree and sitemaps are:


root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...         
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc

Another important thing is the note about using .values(). If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.

Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.

Generate the map of sitemaps

Making the map of maps doesn't need to be gzipped since it's going to be tiny.


def generate_map_of_maps(base_url, outfile):
    root = etree.Element(
        "sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    )

    with open(outfile, "wb") as f:
        f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
        files_created = sorted(glob("sitemap-*.xml.gz"))
        for file_created in files_created:
            sitemap = etree.SubElement(root, "sitemap")
            uri = "{}/{}".format(base_url, os.path.basename(file_created))
            etree.SubElement(sitemap, "loc").text = uri
            lastmod = datetime.datetime.fromtimestamp(
                os.stat(file_created).st_mtime
            ).strftime("%Y-%m-%d")
            etree.SubElement(sitemap, "lastmod").text = lastmod
        f.write(etree.tostring(root, pretty_print=True))

And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz) and that's good enough.

Bonus and Thoughts

The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True and I experimented with not writing with gzip.open and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli and it didn't make much of a difference.

I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.

Generate a random IP address in Python

June 1, 2019
0 comments Django, Python

I have a commenting system where people can type in a comment and optionally their name and email if they like.
In production, where things are real, the IP address that can be collected are all interestingly different. But when testing this manually on my laptop, since the server is running http://localhost:8000, the request.META.get('REMOTE_ADDR') always becomes 127.0.0.1. Boring! So I fake it. Like this:


import random
from ipaddress import IPv4Address


def _random_ip_address(seed):
    random.seed(seed)
    return str(IPv4Address(random.getrandbits(32)))


...
# Here's the code deep inside the POST handler just before storing 
# the form submission the database.

if settings.DEBUG and metadata.get("REMOTE_ADDR") == "127.0.0.1":
    # Make up a random one!
    metadata["REMOTE_ADDR"] = _random_ip_address(
        str(form.cleaned_data["name"]) + str(form.cleaned_data["email"])
    )

It's pretty rough but it works and makes me happy.

Best way to count distinct indexed things in PostgreSQL

March 21, 2019
3 comments Django, PostgreSQL

tl;dr; SELECT COUNT(*) FROM (SELECT DISTINCT my_not_unique_indexed_column FROM my_table) t;

I have a table that looks like this:

songsearch=# \d main_songtexthash
            Table "public.main_songtexthash"
  Column   |           Type           | Collation | Nullable |
-----------+--------------------------+-----------+----------+
 id        | integer                  |           | not null |
 text_hash | character varying(32)    |           | not null |
 created   | timestamp with time zone |           | not null |
 modified  | timestamp with time zone |           | not null |
 song_id   | integer                  |           | not null |
Indexes:
    "main_songtexthash_pkey" PRIMARY KEY, btree (id)
    "main_songtexthash_song_id_key" UNIQUE CONSTRAINT, btree (song_id)
    "main_songtexthash_text_hash_c2771f1f" btree (text_hash)
    "main_songtexthash_text_hash_c2771f1f_like" btree (text_hash varchar_pattern_ops)
Foreign-key constraints:
    ...snip...

And the data looks something like this:

songsearch=# select text_hash, song_id from main_songtexthash limit 10;
            text_hash             | song_id
----------------------------------+---------
 6f98e1945e64353bead9d6ab47a7f176 | 2565031
 0c6662363aa4a340fea5efa24c98db76 |  486091
 a25af539b183cbc338409c7acecc6828 |     212
 5aaf561b38c251e7d863aae61fe1363f | 2141077
 6a221df60f7cbb8a4e604f87c9e3aec0 |  245186
 d2a0b5b3b33cdf5e03a75cfbf4963a6f | 1453382
 95c395dd78679120269518b19187ca80 |  981402
 8ab19b32b3be2d592aa69e4417b732cd |  616848
 8ab19b32b3be2d592aa69e4417b732cd |  243393
 01568f1f57aeb7a97e2544978fc93b4c |     333
(10 rows)

If you look carefully, you'll notice that every song_id has a different text_hash except two of them.
Song IDs 616848 and 243393 both have the same text_hash of value 8ab19b32b3be2d592aa69e4417b732cd.

Also, if you imagine this table only has 10 rows, you could quickly and easily conclude that there are 10 different song_id but 9 different distinct text_hash. However, how do you do this counting if the tables are large??

The Wrong Way

songsearch=# select count(distinct text_hash) from main_songtexthash;
  count
---------
 1825983
(1 row)

And the explanation and cost analysis is:

songsearch=# explain analyze select count(distinct text_hash) from main_songtexthash;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=44942.09..44942.10 rows=1 width=8) (actual time=40029.225..40029.226 rows=1 loops=1)
   ->  Seq Scan on main_songtexthash  (cost=0.00..40233.87 rows=1883287 width=33) (actual time=0.029..193.653 rows=1879521 loops=1)
 Planning Time: 0.059 ms
 Execution Time: 40029.250 ms
(4 rows)

Oh noes! A Sec Scan! Run!

The Right Way

Better explained in this blog post but basically, cutting to the chase, here's how you count on an indexed field:

songsearch=# select count(*) from (select distinct text_hash from main_songtexthash) t;
  count
---------
 1825983
(1 row)

And the explanation and cost analysis is:

songsearch=# explain analyze select count(*) from (select distinct text_hash from main_songtexthash) t;
                                                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=193871.20..193871.21 rows=1 width=8) (actual time=4894.555..4894.556 rows=1 loops=1)
   ->  Unique  (cost=0.55..172861.54 rows=1680773 width=33) (actual time=0.075..4704.741 rows=1825983 loops=1)
         ->  Index Only Scan using main_songtexthash_text_hash_c2771f1f on main_songtexthash  (cost=0.55..168153.32 rows=1883287 width=33) (actual time=0.074..4132.822 rows=1879521 loops=1)
               Heap Fetches: 1879521
 Planning Time: 0.082 ms
 Execution Time: 4894.581 ms
(6 rows)

Same exact result but ~5s instead of ~40s. I'll take that, thank you very much.

The Django Way

As a bonus: Django is smart. Here's how they do it:


>>> SongTextHash.objects.values('text_hash').distinct().count()
1825983

And, the SQL it generates to make that count looks very familiar:


SELECT COUNT(*) FROM (SELECT DISTINCT "main_songtexthash"."text_hash" AS Col1 FROM "main_songtexthash") subquery

Conclusion

  • Avoid "sequential scans" like the plague if you care about performance (...or not just killing your resources).
  • Trust in Django.