Build an XML sitemap of XML sitemaps

Saturday, Jun 1, 2019
0 comments Python, Django

Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml file that has all the URLs (aka <loc>) listed. Then you need to make a /sitemaps.xml that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.

The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.

Generate the sitemap `.xml.gz` file(s)

Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree trees and dumps them to disk with gzip.


import gzip

from lxml import etree


outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000


def generate(self, qs, base_url, outfile, batchsize):
    # Use `.values` to make the query much faster
    qs = qs.values("name", "id", "artist_id", "language")

    def start():
        return etree.Element(
            "urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        )

    def close(root, filename):
        with gzip.open(filename, "wb") as f:
            f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
            f.write(etree.tostring(root, pretty_print=True))

    root = filename = None

    count = 0
    for song in qs.iterator():
        if not count % batchsize:
            if filename:  # not the very first loop
                close(root, filename)
                yield filename
            filename = outfile.format(start=count, end=count + batchsize)
            root = start()
        loc = "{}{}".format(base_url, make_song_url(song))
        etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
        count += 1
    close(root, filename)
    yield filename

The most important lines in terms of lxml.etree and sitemaps are:


root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...         
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc

Another important thing is the note about using .values(). If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.

Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.

Generate the map of sitemaps

Making the map of maps doesn't need to be gzipped since it's going to be tiny.


def generate_map_of_maps(base_url, outfile):
    root = etree.Element(
        "sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    )

    with open(outfile, "wb") as f:
        f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
        files_created = sorted(glob("sitemap-*.xml.gz"))
        for file_created in files_created:
            sitemap = etree.SubElement(root, "sitemap")
            uri = "{}/{}".format(base_url, os.path.basename(file_created))
            etree.SubElement(sitemap, "loc").text = uri
            lastmod = datetime.datetime.fromtimestamp(
                os.stat(file_created).st_mtime
            ).strftime("%Y-%m-%d")
            etree.SubElement(sitemap, "lastmod").text = lastmod
        f.write(etree.tostring(root, pretty_print=True))

And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz) and that's good enough.

Bonus and Thoughts

The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True and I experimented with not writing with gzip.open and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli and it didn't make much of a difference.

I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.

Comments

Previous:: Generate a random IP address in Python June 1, 2019 Python, Django
Next:: From jQuery to Cash June 18, 2019 Web development, JavaScript

Related by category:: Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Django, Python; A Python dict that can report which keys you did not use June 12, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python; How to resolve a git conflict in poetry.lock February 7, 2020 Python

Related by keyword:: Fastest way to turn HTML into text in Python January 8, 2021 Python; How much faster is Redis at storing a blob of JSON compared to PostgreSQL? September 28, 2019 Python, PostgreSQL, Redis; mincss "Clears the junk out of your CSS" January 21, 2013 Python, Web development; UPPER vs. ILIKE April 19, 2010 Web development

Build an XML sitemap of XML sitemaps

Generate the sitemap .xml.gz file(s)

Generate the map of sitemaps

Bonus and Thoughts

Comments

Related posts

Generate the sitemap `.xml.gz` file(s)