Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml
file that has all the URLs (aka <loc>
) listed. Then you need to make a /sitemaps.xml
that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.
The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz
files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.
Generate the sitemap .xml.gz
file(s)
Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree
trees and dumps them to disk with gzip
.
import gzip
from lxml import etree
outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000
def generate(self, qs, base_url, outfile, batchsize):
# Use `.values` to make the query much faster
qs = qs.values("name", "id", "artist_id", "language")
def start():
return etree.Element(
"urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
def close(root, filename):
with gzip.open(filename, "wb") as f:
f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
f.write(etree.tostring(root, pretty_print=True))
root = filename = None
count = 0
for song in qs.iterator():
if not count % batchsize:
if filename: # not the very first loop
close(root, filename)
yield filename
filename = outfile.format(start=count, end=count + batchsize)
root = start()
loc = "{}{}".format(base_url, make_song_url(song))
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
count += 1
close(root, filename)
yield filename
The most important lines in terms of lxml.etree
and sitemaps are:
root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
Another important thing is the note about using .values()
. If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.
Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.
Generate the map of sitemaps
Making the map of maps doesn't need to be gzipped since it's going to be tiny.
def generate_map_of_maps(base_url, outfile):
root = etree.Element(
"sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
with open(outfile, "wb") as f:
f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
files_created = sorted(glob("sitemap-*.xml.gz"))
for file_created in files_created:
sitemap = etree.SubElement(root, "sitemap")
uri = "{}/{}".format(base_url, os.path.basename(file_created))
etree.SubElement(sitemap, "loc").text = uri
lastmod = datetime.datetime.fromtimestamp(
os.stat(file_created).st_mtime
).strftime("%Y-%m-%d")
etree.SubElement(sitemap, "lastmod").text = lastmod
f.write(etree.tostring(root, pretty_print=True))
And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz
) and that's good enough.
Bonus and Thoughts
The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True
and I experimented with not writing with gzip.open
and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree
part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli
and it didn't make much of a difference.
I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz
and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.
Comments