How to have default/initial values in a Django form that is bound and rendered

January 10, 2020
11 comments Web development, Django, Python

Django's Form framework is excellent. It's intuitive and versatile and, best of all, easy to use. However, one little thing that is not so intuitive is how do you render a bound form with default/initial values when the form is never rendered unbound.

If you do this in Django:


class MyForm(forms.Form):
    name = forms.CharField(required=False)

def view(request):
    form = MyForm(initial={'name': 'Peter'})
    return render(request, 'page.html', form=form)

# Imagine, in 'page.html' that it does this:
#  <label>Name:</label>
#  {{ form.name }}

...it will render out this:


<label>Name:</label>
<input type="text" name="name" value="Peter">

The whole initial trick is something you can set on the whole form or individual fields. But it's only used in UN-bound forms when rendered.

If you change your view function to this:


def view(request):
    form = MyForm(request.GET, initial={'name': 'Peter'}) # data passed!
    if form.is_valid():  # makes it bound!
        print(form.cleaned_data['name'])
    return render(request, 'page.html', form=form)

Now, the form is bound and the initial stuff is essentially ignored.
Because name is not present in request.GET. And if it was present, but an empty string, it wouldn't be able to benefit for the default value.

My solution

I tried many suggestions and tricks (based on rapid Stackoverflow searching) and nothing worked.

I knew one thing: Only the view should know the actual initial values.

Here's what works:


import copy


class MyForm(forms.Form):
    name = forms.CharField(required=False)

    def __init__(self, data, **kwargs):
        initial = kwargs.get('initial', {})
        data = {**initial, **data}
        super().__init__(data, **kwargs)

Now, suppose you don't have ?name=something in request.GET the line print(form.cleaned_data['name']) will print Peter and the rendered form will look like this:


<label>Name:</label>
<input type="text" name="name" value="Peter">

And, as expected, if you have ?name=Ashley in request.GET it will print Ashley and produce this rendered HTML too:


<label>Name:</label>
<input type="text" name="name" value="Ashley">

UPDATE June 2020

If data is a QueryDict object (e.g. <QueryDict: {'days': ['90']}>), and initial is a plain dict (e.g. {'days': 30}),
then you can merge these with {**data, **initial} because it produces a plain dict of value {'days': [90]} which Django's form stuff doesn't know is supposed to be "flattened".

The solution is to use:


from django.utils.datastructures import MultiValueDict

...

    def __init__(self, data, **kwargs):
        initial = kwargs.get("initial", {})
        data = MultiValueDict({**{k: [v] for k, v in initial.items()}, **data})
        super().__init__(data, **kwargs)

(To be honest; this might work in the app I'm currently working on but I don't feel confident that this is covering all cases)

How to split a block of HTML with Cheerio in NodeJS

January 3, 2020
2 comments Node, JavaScript

cheerio is a great Node library for processing HTML. It's faster than JSDOM and years and years of jQuery usage makes the API feel yummily familiar.

What if you have a piece of HTML that you want to split up into multiple blocks? For example, you have this:


<div>Prelude</div>

<h2>First Header</h2>

<p>Paragraph <b>here</b>.</p>
<p>Another paragraph.</p>

<h2 id="second">Second Header</h2>

<ul>
  <li>One</li>
  <li>Two</li>
</ul>
<blockquote>End quote!</blockquote>

and you want to get this split by the <h2> tags so you end up with 3 (in this example) distinct blocks of HTML, like this:

first one


<div>Prelude</div>

second one


<h2>First Header</h2>

<p>Paragraph <b>here</b>.</p>
<p>Another paragraph.</p>

third one


<h2 id="second">Second Header</h2>

<ul>
  <li>One</li>
  <li>Two</li>
</ul>
<blockquote>End quote!</blockquote>

You could try to cast the regex spell on that and try to, I don't know, split the string by the </h2>. But it's risky and error prone because (although a bit unlikely in this simple example) get caught up in <h2>...</h2> tags that are nested inside something else. Also, proper parsing almost always wins in the long run over regexes.

Use cheerio

This is how I solved it and hopefully A) you can copy and benefit, or B) someone tells me there's already a much better way.

What you do is walk the DOM root nodes, one by one, and keep filling a buffer and then yield individual new cheerio instances.


const html = `
<div>Prelude</div>

<h2>First Header</h2>
<p>Paragraph <b>here</b>.</p>
<p>Another paragraph.</p>
<!-- comment -->

<h2 id="second">Second Header</h2>
<ul>
  <li>One</li>
  <li>Two</li>
</ul>
<blockquote>End quote!</blockquote>
`;

// load the raw HTML
// it needs to all be wrapped in *one* big wrapper
const $ = cheerio.load(`<div id="_body">${html}</div>`);

// the end goal
const blocks = [];

// the buffer
const section = cheerio
  .load("<div></div>", { decodeEntities: false })("div")
  .eq(0);

const iterable = [...$("#_body")[0].childNodes];
let c = 0;
iterable.forEach(child => {
  if (child.tagName === "h2") {
    if (c) {
      blocks.push(section.clone());
      section.empty();
      c = 0; // reset the counter
    }
  }
  c++;
  section.append(child);
});
if (c) {
  // stragglers
  blocks.push(section.clone());
}

// Test the result
const blocksAsStrings = blocks.map(block => block.html());
console.log(blocksAsStrings.length);
// 3
console.log(blocksAsStrings);
// [
//   '\n<div>Prelude</div>\n\n',
//   '<h2>First Header</h2>\n' +
//     '<p>Paragraph <b>here</b>.</p>\n' +
//     '<p>Another paragraph.</p>\n' +
//     '<!-- comment -->\n' +
//     '\n',
//   '<h2 id="second">Second Header</h2>\n' +
//     '<ul>\n' +
//     '  <li>One</li>\n' +
//     '  <li>Two</li>\n' +
//     '</ul>\n' +
//     '<blockquote>End quote!</blockquote>\n'
// ]

In this particular implementation the choice of splitting is by the every h2 tag. If you want to split by anything else, go ahead and adjust the conditional there where it's currently doing if (child.tagName === "h2") {.

Also, what you do with the blocks is up to you. Perhaps you need them as strings, then you use the blocks.map(block => block.html()). Otherwise, if it serves your needs they can remain as individual cheerio instances that you can do whatever with.

A Python and Preact app deployed on Heroku

December 13, 2019
2 comments Web development, Django, Python, Docker, JavaScript

Heroku is great but it's sometimes painful when your app isn't just in one single language. What I have is a project where the backend is Python (Django) and the frontend is JavaScript (Preact). The folder structure looks like this:

/
  - README.md
  - manage.py
  - requirements.txt
  - my_django_app/
     - settings.py
     - asgi.py
     - api/
        - urls.py
        - views.py
  - frontend/
     - package.json
     - yarn.lock
     - preact.config.js
     - build/
        ...
     - src/
        ...

A bunch of things omitted for brevity but people familiar with Django and preact-cli/create-create-app should be familiar.
The point is that the root is a Python app and the front-end is exclusively inside a sub folder.

When you do local development, you start two servers:

  • ./manage.py runserver - starts http://localhost:8000
  • cd frontend && yarn start - starts http://localhost:3000

The latter is what you open in your browser. That preact app will do things like:


const response = await fetch('/api/search');

and, in preact.config.js I have this:


export default (config, env, helpers) => {

  if (config.devServer) {
    config.devServer.proxy = [
      {
        path: "/api/**",
        target: "http://localhost:8000"
      }
    ];
  }

};

...which is hopefully self-explanatory. So, calls like GET http://localhost:3000/api/search actually goes to http://localhost:8000/api/search.

That's when doing development. The interesting thing is going into production.

Before we get into Heroku, let's first "merge" the two systems into one and the trick used is Whitenoise. Basically, Django's web server will be responsibly not only for things like /api/search but also static assets such as / --> frontend/build/index.html and /bundle.17ae4.js --> frontend/build/bundle.17ae4.js.

This is basically all you need in settings.py to make that happen:


MIDDLEWARE = [
    "django.middleware.security.SecurityMiddleware",
    "whitenoise.middleware.WhiteNoiseMiddleware",
    ...
]

WHITENOISE_INDEX_FILE = True

STATIC_URL = "/"
STATIC_ROOT = BASE_DIR / "frontend" / "build"

However, this isn't quite enough because the preact app uses preact-router which uses pushState() and other code-splitting magic so you might have a URL, that users see, like this: https://myapp.example.com/that/thing/special and there's nothing about that in any of the Django urls.py files. Nor is there any file called frontend/build/that/thing/special/index.html or something like that.
So for URLs like that, we have to take a gamble on the Django side and basically hope that the preact-router config knows how to deal with it. So, to make that happen with Whitenoise we need to write a custom middleware that looks like this:


from whitenoise.middleware import WhiteNoiseMiddleware


class CustomWhiteNoiseMiddleware(WhiteNoiseMiddleware):
    def process_request(self, request):
        if self.autorefresh:
            static_file = self.find_file(request.path_info)
        else:
            static_file = self.files.get(request.path_info)

            # These two lines is the magic.
            # Basically, the URL didn't lead to a file (e.g. `/manifest.json`)
            # it's either a API path or it's a custom browser path that only
            # makes sense within preact-router. If that's the case, we just don't
            # know but we'll give the client-side preact-router code the benefit
            # of the doubt and let it through.
            if not static_file and not request.path_info.startswith("/api"):
                static_file = self.files.get("/")

        if static_file is not None:
            return self.serve(static_file, request)

And in settings.py this change:


MIDDLEWARE = [
    "django.middleware.security.SecurityMiddleware",
-   "whitenoise.middleware.WhiteNoiseMiddleware",
+   "my_django_app.middleware.CustomWhiteNoiseMiddleware",
    ...
]

Now, all traffic goes through Django. Regular Django view functions, static assets, and everything else fall back to frontend/build/index.html.

Heroku

Heroku tries to make everything so simple for you. You basically, create the app (via the cli or the Heroku web app) and when you're ready you just do git push heroku master. However that won't be enough because there's more to this than Python.

Unfortunately, I didn't take notes of my hair-pulling excruciating journey of trying to add buildpacks and hacks and Procfiles and custom buildpacks. Nothing seemed to work. Perhaps the answer was somewhere in this issue: "Support running an app from a subdirectory" but I just couldn't figure it out. I still find buildpacks confusing when it's beyond Hello World. Also, I didn't want to run Node as a service, I just wanted it as part of the "build process".

Docker to the rescue

Finally I get a chance to try "Deploying with Docker" in Heroku which is a relatively new feature. And the only thing that scared me was that now I need to write a heroku.yml file which was confusing because all I had was a Dockerfile. We'll get back to that in a minute!

So here's how I made a Dockerfile that mixes Python and Node:


FROM node:12 as frontend

COPY . /app
WORKDIR /app
RUN cd frontend && yarn install && yarn build


FROM python:3.8-slim

WORKDIR /app

RUN groupadd --gid 10001 app && useradd -g app --uid 10001 --shell /usr/sbin/nologin app
RUN chown app:app /tmp

RUN apt-get update && \
    apt-get upgrade -y && \
    apt-get install -y --no-install-recommends \
    gcc apt-transport-https python-dev

# Gotta try moving this to poetry instead!
COPY ./requirements.txt /app/requirements.txt
RUN pip install --upgrade --no-cache-dir -r requirements.txt

COPY . /app
COPY --from=frontend /app/frontend/build /app/frontend/build

USER app

ENV PORT=8000
EXPOSE $PORT

CMD uvicorn gitbusy.asgi:application --host 0.0.0.0 --port $PORT

If you're not familiar with it, the critical trick is on the first line where it builds some Node with as frontend. That gives me a thing I can then copy from into the Python image with COPY --from=frontend /app/frontend/build /app/frontend/build.

Now, at the very end, it starts a uvicorn server with all the static .js, index.html, and favicon.ico etc. available to uvicorn which ultimately runs whitenoise.

To run and build:

docker build . -t my_app
docker run -t -i --rm --env-file .env -p 8000:8000 my_app

Now, opening http://localhost:8000/ is a production grade app that mixes Python (runtime) and JavaScript (static).

Heroku + Docker

Heroku says to create a heroku.yml file and that makes sense but what didn't make sense is why I would add cmd line in there when it's already in the Dockerfile. The solution is simple: omit it. Here's what my final heroku.yml file looks like:


build:
  docker:
    web: Dockerfile

Check in the heroku.yml file and git push heroku master and voila, it works!

To see a complete demo of all of this check out https://github.com/peterbe/gitbusy and https://gitbusy.herokuapp.com/

MDN Documents Size Tree Map

November 14, 2019
0 comments MDN, Web development

Recently I've been playing with the content of MDN as a whole. MDN has ~140k documents in its Wiki. About ~70k of them are redirects which is the result of many years of switching tech and switching information architecture and at the same time being good Internet citizens and avoiding 404s. So, out of the ~70k documents, how do they spread? To answer that I wrote a Python script that evaluates size as a matter of the sum of all the files in sub-trees including pictures.

Here are the screenshots:

All locales

All locales

Specifically en-US

Specifically en-US

The code that puts this together uses Toast UI which seems cool but I didn't spend much time worrying about how to use it.

Be warned! Opening this link will make your browser sweat: https://8mw9v.csb.app/

You can fork it here: https://codesandbox.io/s/zen-swirles-8mw9v

Avoid async when all you have is (SSD) disk I/O in NodeJS

October 24, 2019
1 comment Node, JavaScript

tl;dr; If you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.

I'm not a NodeJS expert so I could really do with some eyes on this.

There is little doubt in my mind that it's smart to use asynchronous ideas when your program has to wait for network I/O. Because network I/O is slow, it's better to let your program work on something else whilst waiting. But disk is actually fast. Especially if you have SSD disk.

The context

I'm working on a Node program that walks a large directory structure and looks for certain file patterns, reads those files, does some processing and then exits. It's a cli basically and it's supposed to work similar to jest where you tell it to go and process files and if everything worked, exit with 0 and if anything failed, exit with something >0. Also, it needs to be possible to run it so that it exits immediately on the first error encountered. This is similar to running jest --bail.

My program needs to process thousands of files and although there are thousands of files, they're all relatively small. So first I wrote a simple reference program: https://github.com/peterbe/megafileprocessing/blob/master/reference.js
What it does is that it walks a directory looking for certain .json files that have certain keys that it knows about. Then, just computes the size of the values and tallies that up. My real program will be very similar except it does a lot more with each .json file.

You run it like this:


▶ CHAOS_MONKEY=0.001 node reference.js ~/stumptown-content/kumadocs -q
Error: Chaos Monkey!
    at processDoc (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:37:11)
    at /Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:80:21
    at Array.forEach (<anonymous>)
    at main (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:78:9)
    at Object.<anonymous> (/Users/peterbe/dev/JAVASCRIPT/megafileprocessing/reference.js:99:20)
    at Module._compile (internal/modules/cjs/loader.js:956:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:973:10)
    at Module.load (internal/modules/cjs/loader.js:812:32)
    at Function.Module._load (internal/modules/cjs/loader.js:724:14)
    at Function.Module.runMain (internal/modules/cjs/loader.js:1025:10)
Total length for 4057 files is 153953645
1 files failed.

(The environment variable CHAOS_MONKEY=0.001 makes it so there's a 0.1% chance it throws an error)

It processed 4,057 files and one of those failed (thanks to the "chaos monkey").
In its current state that (on my MacBook) that takes about 1 second.

It's not perfect but it's a good skeleton. Everything is synchronous. E.g.


function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;
  files.forEach(file => {
    try {
      const total = processDoc(file, quiet);
      !quiet && console.log(`${file} is ${total}`);
      totalTotal += total;
    } catch (err) {
      if (bail) {
        throw err;
      } else {
        console.error(err);
        errors++;
      }
    }
  });
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
  }
  return errors ? 1 : 0;
}

And inside the processDoc function it used const content = fs.readFileSync(fspath, "utf8");.

I/Os compared

@amejiarosario has a great blog post called "What every programmer should know about Synchronous vs. Asynchronous Code". In it, he has this great bar chart:

Latency vs. System Event

If you compare "SSD I/O" with "Network SFO/NCY" the difference is that SSD I/O is 456 times "faster" than SFO-to-NYC network I/O. I.e. the latency is 456 times less.

Another important aspect when processing lots of files is garbage collection. When running synchronous, it can garbage collect as soon as it has processed one file before moving on to the next. If it was asynchronous, as soon as it yields to move on to the next file, it might hold on to memory from the first file. Why does this matter? Because if the memory-usage when processing many files asynchronously bloat so hard that it actually crashes with an out-of-memory error. So what matters is avoiding that. It's OK if the program can use lots of memory if it needs to, but it's really bad if it crashes.

One way to measure this is to use /usr/bin/time -l (at least that's what it's called on macOS). For example:

▶ /usr/bin/time -l node reference.js ~/stumptown-content/kumadocs -q
Total length for 4057 files is 153970749
        0.75 real         0.58 user         0.23 sys
  57221120  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     64160  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
      1074  involuntary context switches

Its maximum memory usage total was 57221120 bytes (55MB) in this example.

Introduce asynchronous file reading

Let's change the reference implementation to use const content = await fsPromises.readFile(fspath, "utf8");. We're still using files.forEach(file => { but within the loop the whole function is prefixed with async function main() { now. Like this:


async function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;

  let total;
  for (let file of files) {
    try {
      total = await processDoc(file, quiet);
      !quiet && console.log(`${file} is ${total}`);
      totalTotal += total;
    } catch (err) {
      if (bail) {
        throw err;
      } else {
        console.error(err);
        errors++;
      }
    }
  }
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
  }
  return errors ? 1 : 0;
}

Let's see how it works:

▶ /usr/bin/time -l node async1.js ~/stumptown-content/kumadocs -q
Total length for 4057 files is 153970749
        1.31 real         1.01 user         0.49 sys
  68898816  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
     68107  page reclaims
         0  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
     62562  involuntary context switches

That means it maxed out at 68898816 bytes (65MB).

You can already see a difference. 0.79 seconds and 55MB for synchronous and 1.31 seconds and 65MB for asynchronous.

But to really measure this, I wrote a simple Python program that runs this repeatedly and reports a min/median on time and max on memory:

▶ python3 wrap_time.py /usr/bin/time -l node reference.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   0.74s
WORST:  0.84s
MEAN:   0.78s
MEDIAN: 0.78s
MAX MEMORY
BEST:   53.5MB
WORST:  55.3MB
MEAN:   54.6MB
MEDIAN: 54.8MB

And for the asynchronous version:

▶ python3 wrap_time.py /usr/bin/time -l node async1.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   1.28s
WORST:  1.82s
MEAN:   1.39s
MEDIAN: 1.31s
MAX MEMORY
BEST:   65.4MB
WORST:  67.7MB
MEAN:   66.7MB
MEDIAN: 66.9MB

Promise.all version

I don't know if the async1.js is realistic. More realistically you'll want to not wait for one file to be processed (asynchronously) but start them all at the same time. So I made a variation of the asynchronous version that looks like this instead:


async function main(args) {
  // By default, don't exit if any error happens
  const { bail, quiet, root } = parseArgs(args);
  const files = walk(root, ".json");
  let totalTotal = 0;
  let errors = 0;

  let values;
  values = await Promise.all(
    files.map(async file => {
      try {
        total = await processDoc(file, quiet);
        !quiet && console.log(`${file} is ${total}`);
        return total;
      } catch (err) {
        if (bail) {
          console.error(err);
          process.exit(1);
        } else {
          console.error(err);
          errors++;
        }
      }
    })
  );
  totalTotal = values.filter(n => n).reduce((a, b) => a + b);
  console.log(`Total length for ${files.length} files is ${totalTotal}`);
  if (errors) {
    console.warn(`${errors} files failed.`);
    throw new Error("More than 0 errors");
  }
}

You can see the whole file here: async2.js

The key difference is that it uses await Promise.all(files.map(...)) instead of for (let file of files) {.
Also, to accomplish the ability to bail on the first possible error it needs to use process.exit(1); within the callbacks. Not sure if that's right but from the outside, you get the desired effect as a cli program. Let's measure it too:

▶ python3 wrap_time.py /usr/bin/time -l node async2.js ~/stumptown-content/kumadocs -q
...
TIMES
BEST:   1.44s
WORST:  1.61s
MEAN:   1.52s
MEDIAN: 1.52s
MAX MEMORY
BEST:   434.0MB
WORST:  460.2MB
MEAN:   453.4MB
MEDIAN: 456.4MB

Note how this uses almost 10x max. memory. That's dangerous if the processing is really memory hungry individually.

When asynchronous is right

In all of this, I'm assuming that the individual files are small. (Roughly, each file in my experiment is about 50KB)
What if the files it needs to read from disk are large?

As a simple experiment read /users/peterbe/Downloads/Keybase.dmg 20 times and just report its size:


for (let x = 0; x < 20; x++) {
  fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
    if (err) throw err;
    console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
  });
}

See the simple-async.js here. Basically it's this:


for (let x = 0; x < 20; x++) {
  fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
    if (err) throw err;
    console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
  });
}

Results are:

▶ python3 wrap_time.py /usr/bin/time -l node simple-async.js
...
TIMES
BEST:   0.84s
WORST:  4.32s
MEAN:   1.33s
MEDIAN: 0.97s
MAX MEMORY
BEST:   1851.1MB
WORST:  3079.3MB
MEAN:   2956.3MB
MEDIAN: 3079.1MB

And the equivalent synchronous simple-sync.js here.


for (let x = 0; x < 20; x++) {
  const largeFile = fs.readFileSync("/users/peterbe/Downloads/Keybase.dmg");
  console.log(`File size#${x}: ${Math.round(largeFile.length / 1e6)} MB`);
}

It performs like this:

▶ python3 wrap_time.py /usr/bin/time -l node simple-sync.js
...
TIMES
BEST:   1.97s
WORST:  2.74s
MEAN:   2.27s
MEDIAN: 2.18s
MAX MEMORY
BEST:   1089.2MB
WORST:  1089.7MB
MEAN:   1089.5MB
MEDIAN: 1089.5MB

So, almost 2x as slow but 3x as much max. memory.

Lastly, instead of an iterative loop, let's start 20 readers at the same time (simple-async2.js):


Promise.all(
  [...Array(20).fill()].map((_, x) => {
    return fs.readFile("/users/peterbe/Downloads/Keybase.dmg", (err, data) => {
      if (err) throw err;
      console.log(`File size#${x}: ${Math.round(data.length / 1e6)} MB`);
    });
  })
);

And it performs like this:

▶ python3 wrap_time.py /usr/bin/time -l node simple-async2.js
...
TIMES
BEST:   0.86s
WORST:  1.09s
MEAN:   0.96s
MEDIAN: 0.94s
MAX MEMORY
BEST:   3079.0MB
WORST:  3079.4MB
MEAN:   3079.2MB
MEDIAN: 3079.2MB

So quite naturally, the same total time as the simple async version but uses 3x max. memory every time.

Ergonomics

I'm starting to get pretty comfortable with using promises and async/await. But I definitely feel more comfortable without. Synchronous programs read better from an ergonomics point of view. The async/await stuff is just Promises under the hood and it's definitely an improvement but the synchronous versions just have a simpler "feeling" to it.

Conclusion

I don't think it's a surprise that the overhead of event switching adds more time than its worth when the individual waits aren't too painful.

A major flaw with synchronous programs is that they rely on the assumption that there's no really slow I/O. So what if the program grows and morphs so that it someday does depend on network I/O then your synchronous program is "screwed" since an asynchronous version would run circles around it.

The general conclusion is; if you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.

Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON

September 30, 2019
2 comments Redis, Nginx, Web Performance, Python, Django, PostgreSQL

Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.

However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.

But is it entirely wrong to look at WHICH database to get the best speed?

Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx

Nginx??

The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:


const response = await fetch('https://songsear.ch/api/song/5246889/amazon');

Internally, in the app, what it does is that it looks this up, by ID, on the AmazonAffiliateLookup ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as ProductTypeName, Studio, TrackSequence etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.

Again, Nginx?

Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.

Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:

▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes

Summary:
  Total:    0.9882 secs
  Slowest:  0.0279 secs
  Fastest:  0.0001 secs
  Average:  0.0010 secs
  Requests/sec: 10119.8265


Response time histogram:
  0.000 [1] |
  0.003 [9752]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [108]   |
  0.008 [70]    |
  0.011 [32]    |
  0.014 [8] |
  0.017 [12]    |
  0.020 [11]    |
  0.022 [1] |
  0.025 [4] |
  0.028 [1] |


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0006 secs
  50% in 0.0008 secs
  75% in 0.0010 secs
  90% in 0.0013 secs
  95% in 0.0016 secs
  99% in 0.0068 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0001 secs, 0.0279 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0026 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0011 secs
  resp wait:    0.0008 secs, 0.0001 secs, 0.0206 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0013 secs

Status code distribution:
  [200] 10000 responses

10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a Cache-Control HTTP header with something like max-age=86400.

Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long Cache-Control, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).

Conclusion

If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.

How much faster is Redis at storing a blob of JSON compared to PostgreSQL?

September 28, 2019
65 comments Python, PostgreSQL, Redis

tl;dr; Redis is 16 times faster at reading these JSON blobs.

In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?

Example view of affiliate links

The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:


class AmazonAffiliateLookup(models.Model, TotalCountMixin):
    song = models.ForeignKey(Song, on_delete=models.CASCADE)
    matches = JSONField(null=True)
    search_index = models.CharField(max_length=100, null=True)
    lookup_seconds = models.FloatField(null=True)
    created = models.DateTimeField(auto_now_add=True, db_index=True)
    modified = models.DateTimeField(auto_now=True)

At the moment this database table is 3GB on disk.

Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.

The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:


CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": REDIS_URL,
        "TIMEOUT": config("CACHE_TIMEOUT", 500),
        "KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
        "OPTIONS": {
            "COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
            "SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
        },
    }
}

The speed difference

Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost.

Reads

The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.

I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:

PG:
median: 8.66ms
mean  : 11.18ms
stdev : 19.48ms

Redis:
median: 0.53ms
mean  : 0.84ms
stdev : 2.26ms

(310 measurements)

It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.

Writes

The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:

PG:
median: 8.59ms
mean  : 8.58ms
stdev : 6.78ms

Redis:
median: 0.44ms
mean  : 0.49ms
stdev : 0.27ms

(137 measurements)

It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.

Conclusion and discussion

First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.

Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.

I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.

UPDATE September 29, 2019

I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:


+from dataclasses import dataclass


+@dataclass
+class _Lookup:
+    modified: datetime.datetime
+    matches: list

...

+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)

print(lookup.modified)

Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.

uwsgi weirdness with --http

September 19, 2019
2 comments Python, Linux

Instead of upgrading everything on my server, I'm just starting from scratch. From Ubuntu 16.04 to Ubuntu 19.04 and I also upgraded everything else in sight. One of them was uwsgi. I copied various user config files but for uwsgi things didn't very well. On the old server I had uwsgi version 2.0.12-debian and on the new one 2.0.18-debian. The uWSGI changelog is pretty hard to read but I sure don't see any mention of this.

You see, on SongSearch I have it so that Nginx talks to Django via a uWSGI socket. But the NodeJS server talks to Django via 127.0.0.1:PORT. So I need my uWSGI config to start both. Here was the old config:

[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
http = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8

(The only difference on the new server was the python37 plugin instead)

I start it and everything looks fine. No errors in the log files. And netstat looks like this:

# netstat -ntpl | grep 9090
tcp        0      0 127.0.0.1:9090          0.0.0.0:*               LISTEN      1855/uwsgi

But every time I try to curl localhost:9090 I kept getting curl: (52) Empty reply from server. Nothing in the log files! It seemed no matter what I tried I just couldn't talk to it over HTTP. No, I'm not a sysadmin. I'm just a hobbyist trying to stand up my little server with the tools and limited techniques I know but I was stumped.

The solution

After endless Googling for a resolution and trying all sorts of uwsgi commands directly, I somehow stumbled on the solution.


[uwsgi]
plugins = python35
virtualenv = /var/lib/django/songsearch/venv
pythonpath = /var/lib/django/songsearch
user = django
uid = django
master = true
processes = 3
enable-threads = true
touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
-http = 127.0.0.1:9090
+http-socket = 127.0.0.1:9090
module = songsearch.wsgi:application
env = LANG=en_US.utf8
env = LC_ALL=en_US.UTF-8
env = LC_LANG=en_US.UTF-8

With this one subtle change, I can now curl localhost:9090 and I still have the /var/run/uwsgi/app/songsearch/socket socket. So, yay!

I'm blogging about this in case someone else ever gets stuck in the same nasty surprise as me.

Also, I have to admit, I was fuming with rage from this frustration. It's really inspired me to revive the quest for an alternative to uwsgi because I'm not sure it's that great anymore. There are new alternatives such as gunicorn, gunicorn with Meinheld, bjoern etc.

Fastest Python function to slugify a string

September 12, 2019
4 comments Python

In MDN I noticed a function that turns a piece of text (Python 2 unicode) into a slug. It looks like this:


    non_url_safe = ['"', '#', '$', '%', '&', '+',
                    ',', '/', ':', ';', '=', '?',
                    '@', '[', '\\', ']', '^', '`',
                    '{', '|', '}', '~', "'"]

    def slugify(self, text):
        """
        Turn the text content of a header into a slug for use in an ID
        """
        non_safe = [c for c in text if c in self.non_url_safe]
        if non_safe:
            for c in non_safe:
                text = text.replace(c, '')
        # Strip leading, trailing and multiple whitespace, convert remaining whitespace to _
        text = u'_'.join(text.split())
        return text

The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.

I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.

The candidates


translate_table = {ord(char): u'' for char in non_url_safe}
non_url_safe_regex = re.compile(
    r'[{}]'.format(''.join(re.escape(x) for x in non_url_safe)))


def _slugify1(self, text):
    non_safe = [c for c in text if c in self.non_url_safe]
    if non_safe:
        for c in non_safe:
            text = text.replace(c, '')
    text = u'_'.join(text.split())
    return text

def _slugify2(self, text):
    text = text.translate(self.translate_table)
    text = u'_'.join(text.split())
    return text

def _slugify3(self, text):
    text = self.non_url_safe_regex.sub('', text).strip()
    text = u'_'.join(re.split(r'\s+', text))
    return text

I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.

The results

The slowest is fast enough. But if you're still reading, here are the results:

_slugify1 0.101ms
_slugify2 0.019ms
_slugify3 0.033ms

So using a translate table is 5 times faster. And a regex 3 times faster. But they're all sufficiently fast.

Conclusion

This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.

Also, aren't there better solutions that just blacklist all control characters?

NodeJS fs walk() or glob or fast-glob

August 31, 2019
3 comments JavaScript

It started with this:


function walk(directory, filepaths = []) {
    const files = fs.readdirSync(directory);
    for (let filename of files) {
        const filepath = path.join(directory, filename);
        if (fs.statSync(filepath).isDirectory()) {
            walk(filepath, filepaths);
        } else if (path.extname(filename) === '.md') {
            filepaths.push(filepath);
        }
    }
    return filepaths;
}

And you use it like this:


const foundFiles = walk(someDirectoryOfMine);
console.log(foundFiles.length);

I thought, perhaps it's faster or better to use glob. So I installed that.
Then I found, fast-glob which sounds faster. You use both in a synchronous way.

I have a directory with about 450 files, of which 320 of them are .md files. Let's compare:

walk: 10.212ms
glob: 37.492ms
fg: 14.200ms

I measured it using console.time like this:


console.time('walk');
const foundFiles = walk(someDirectoryOfMine);
console.timeEnd('walk');
console.log(foundFiles.length);

I suppose those packages have other fancier features but, I guess this just goes to show, keep it simple.

UPDATE June 2021

The origins of this blog post were that I need a simple function to find files on disk. Later, the requirements became a bit more complex so I needed something a bit more advanced. In shopping around I found fdir which, from testing, performed excellently and has a great API (and documentation). I would handsdown use that again.