So the context is this; a zip file is uploaded into a web service and Python then needs extract that and analyze and deal with each file within. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3.

Uploads today
The challenge is that these zip files that come in are huuuge. The average is 560MB but some are as much as 1GB. Within them, there are mostly plain text files but there are some binary files in there too that are huge. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.

At first I tried unzipping the file, in memory, and deal with one file at a time. That failed spectacularly with various memory explosions and EC2 running out of memory. I guess it makes sense. First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in /tmp) and then iterate over the files. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time. Is there perhaps a way to optimize that?

Baseline function

First it's these common functions that simulate actually doing something with the files in the zip file:


def _count_file(fn):
    with open(fn, 'rb') as f:
        return _count_file_object(f)


def _count_file_object(f):
    # Note that this iterates on 'f'.
    # You *could* do 'return len(f.read())'
    # which would be faster but potentially memory 
    # inefficient and unrealistic in terms of this 
    # benchmark experiment. 
    total = 0
    for line in f:
        total += len(line)
    return total

Here's the simplest one possible:


def f1(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extractall(dest)

    total = 0
    for root, dirs, files in os.walk(dest):
        for file_ in files:
            fn = os.path.join(root, file_)
            total += _count_file(fn)
    return total

If I analyze it a bit more carefully, I find that it spends about 40% doing the extractall and 60% doing the looping over files and reading their full length.

First attempt

My first attempt was to try to use threads. You create an instance of zipfile.ZipFile, extract every file name within and start a thread for each name. Each thread is given a function that does the "meat of the work" (in this benchmark, iterating over the file and getting its total size). In reality that function does a bunch of complicated S3, Redis and PostgreSQL stuff but in my benchmark I just made it a function that figures out the total length of file. The thread pool function:


def f2(fn, dest):

    def unzip_member(zf, member, dest):
        zf.extract(member, dest)
        fn = os.path.join(dest, member.filename)
        return _count_file(fn)

    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ThreadPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member,
                        zf,
                        member,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~10% faster

Second attempt

So perhaps the GIL is blocking me. The natural inclination is to try to use multiprocessing to spread the work across multiple available CPUs. But doing so has the disadvantage that you can't pass around a non-pickleable object so you have to send just the filename to each future function:


def unzip_member_f3(zip_filepath, filename, dest):
    with open(zip_filepath, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extract(filename, dest)
    fn = os.path.join(dest, filename)
    return _count_file(fn)



def f3(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ProcessPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member_f3,
                        fn,
                        member.filename,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~300% faster

That's cheating!

The problem with using a pool of processors is that it requires that the original .zip file exists on disk. So in my web server, to use this solution, I'd first have to save the in-memory ZIP file to disk, then invoke this function. Not sure what the cost of that it's not likely to be cheap.

Well, it doesn't hurt to poke around. Perhaps, it could be worth it if the extraction was significantly faster.

But remember! This optimization depends on using up as many CPUs as it possibly can. What if some of those other CPUs are needed for something else going on in gunicorn? Those other processes would have to patiently wait till there's a CPU available. Since there's other things going on in this server, I'm not sure I'm willing to let on process take over all the other CPUs.

Conclusion

Doing it serially turns out to be quite nice. You're bound to one CPU but the performance is still pretty good. Also, just look at the difference in the code between f1 and f2! With concurrent.futures pool classes you can cap the number of CPUs it's allowed to use but that doesn't feel great either. What if you get the number wrong in a virtual environment? Of if the number is too low and don't benefit any from spreading the workload and now you're just paying for overheads to move the work around?

I'm going to stick with zipfile.ZipFile(file_buffer).extractall(temp_dir). It's good enough for this.

Want to try your hands on it?

I did my benchmarking using a c5.4xlarge EC2 server. The files can be downloaded from:

wget https://www.peterbe.com/unzip-in-parallel/hack.unzip-in-parallel.py
wget https://www.peterbe.com/unzip-in-parallel/symbols-2017-11-27T14_15_30.zip

The .zip file there is 34MB which is relatively small compared to what's happening on the server.

The hack.unzip-in-parallel.py is a hot mess. It contains a bunch of terrible hacks and ugly stuff but hopefully it's a start.

Comments

Post your own comment
Anonymous

Consider using bzip2, xz or something similar which can be both compressed and decompressed in parallel. See pbzip2 and pixz. (Zip does not lend itself to this, so pigz would not help you too much under most circumstances.)

Choose the right UNIX tool, subprocesses and pipes are the way to go here and pipes can go across network boundaries as well - rely on the deep history of computing.

Peter Bengtsson

Zip files can be decompressed in parallel too. You first ask the zip file for its file contents (fast), then you start parallel processes (or threads or whatever) that each get a file name to extract.

My multiprocess example above is a proof of that.

Martin Bammer

You should not use ThreadPoolExecutor, because it has a relatively high overhead and is very slow. Especially if you use submit. Use a different thread pool implementation or if you want to stick with ThreadPoolExecutor use map. A much faster solution is the good old multiprocessing.pool.ThreadPool. Or you could also try my fastthreadpool module (https://github.com/brmmm3/fastthreadpool). This module is faster than ThreadPool and much faster than ThreadPoolExecutor. It also has the advantage that you can use generator functions as worker which is very useful in certain situations. For example code please look into benchmark.py. The doc directory contains some benchmark files which show the overhead difference between the 3 thread pool implementations. I'm also working on a fastprocesspool module but this is still not finished and buggy, First tests have shown that this module is also faster than multiprocessing.Pool and much fast than ProcessPoolExecutor.

Peter Bengtsson

What's the alternative to submit?
And multiprocessing.Pool might be marginally faster, but isn't the bulk of the computation still in the actual unzipping?

Martin Bammer

The bulk is indeed in unzipping. But if you've an archive with many small files the overhead of the pool can be 10% or more. And this is much for just handling a pool of threads. The alternative is to use map, where you have to prepare an iterable before calling map. Another alternative is to switch to a faster pool implementation.
The module zipfile is completely written in Python, which comes with a relatively big overhead at Python level, which in turn means the GIL is locked relatively long. The result is a low level of parallelisation. I'm currently writing an archiving tool which uses msgpack and zstd. Both libraries have a very thin Python layer and parallelisation with threads is very good. I get nearly 100% CPU load. The results currently are ~4 faster than zip and compression ratio between zip and lzma. When the tool is finished I'll release it for the public.

Peter Bengtsson

If you have an example of using fastthreadpool that I can use instead of concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor then I'll try to run a benchmark with that too.

Masklinn

Peter, if you don't need the entire content of the file in memory (which I guess you don't since you're apparently fine with reading it from disk despite in-memory decompression blowing up) *and* you don't need seekability, you may have missed the `open` method on zipfiles. You'll still be paying for the decompression work, and as noted previously you can only access data sequentially (no seeking backwards) but it *is* lazy, you ask for a 10k chunk of a file, you get 10k in memory (plus some overhead from Python).

Peter Bengtsson

Doing an open requires all things to be in-memory. It's a bit faster but would require me to have a much beefier server, which might be the best course of action actually.

Masklinn

> Doing an open requires all things to be in-memory.

It only requires that the zip file be accessible, either in-memory or on-disk.

Your introduction says that you originally had the zip file *and* the decompressed files in memory, ZipFile.open avoids the latter. Furthermore the small bit of code you post in the conclusion seems to hint that you still have the zip file in memory when you do the extraction.

And again, you can use ZipFile.open just fine if you have the zip file on-disk, it's still lazy.

jfs

It might be more efficient to count bytes by reading chunks instead of lines: `_count_file_object = lambda file: sum(map(len, iter(lambda: file.read(1<<15), b'')))` (assuming you don't want `os.path.getsize(filename)` or similar). Related: https://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python

mike

awesome! thanks a lot!

Anonymous

Peter, Can we do similar improvements with ZipFile.write() if it irequired to zip folders and files in parallel?
Thanks.

Anonymous

Thanks for this code. I have something similar based on it running in production and I hit a wall.
Decompressing archives containing 65688 files will result to be very very slow.

To have a significant speed boost you can try something similar to this https://github.com/ITISFoundation/osparc-simcore/blob/cd9640c0072fcd2536a0ae11cd602e7b7a9ea3ee/packages/service-library/src/servicelib/archiving_utils.py#L45

It basically avoids reading the entire file list from the zipfile inside the background process!

Viola

How should I increase the instance number or thread number?because I have many zip file to extract.

Peter Bengtsson

From https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor
"If max_workers is None or not given, it will default to the number of processors on the machine."

So by default, it will use as many CPUs as your computer possibly has.

Your email will never ever be published.

Related posts

Go to top of the page