Fastest way to unzip a zip file in Python

31 January 2018   10 comments   Python

So the context is this; a zip file is uploaded into a web service and Python then needs extract that and analyze and deal with each file within. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3.

Uploads today
The challenge is that these zip files that come in are huuuge. The average is 560MB but some are as much as 1GB. Within them, there are mostly plain text files but there are some binary files in there too that are huge. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.

At first I tried unzipping the file, in memory, and deal with one file at a time. That failed spectacularly with various memory explosions and EC2 running out of memory. I guess it makes sense. First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in /tmp) and then iterate over the files. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time. Is there perhaps a way to optimize that?

Baseline function

First it's these common functions that simulate actually doing something with the files in the zip file:

def _count_file(fn):
    with open(fn, 'rb') as f:
        return _count_file_object(f)

def _count_file_object(f):
    # Note that this iterates on 'f'.
    # You *could* do 'return len(f.read())'
    # which would be faster but potentially memory 
    # inefficient and unrealistic in terms of this 
    # benchmark experiment. 
    total = 0
    for line in f:
        total += len(line)
    return total

Here's the simplest one possible:

def f1(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extractall(dest)

    total = 0
    for root, dirs, files in os.walk(dest):
        for file_ in files:
            fn = os.path.join(root, file_)
            total += _count_file(fn)
    return total

If I analyze it a bit more carefully, I find that it spends about 40% doing the extractall and 60% doing the looping over files and reading their full length.

First attempt

My first attempt was to try to use threads. You create an instance of zipfile.ZipFile, extract every file name within and start a thread for each name. Each thread is given a function that does the "meat of the work" (in this benchmark, iterating over the file and getting its total size). In reality that function does a bunch of complicated S3, Redis and PostgreSQL stuff but in my benchmark I just made it a function that figures out the total length of file. The thread pool function:

def f2(fn, dest):

    def unzip_member(zf, member, dest):
        zf.extract(member, dest)
        fn = os.path.join(dest, member.filename)
        return _count_file(fn)

    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ThreadPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member,
                        zf,
                        member,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~10% faster

Second attempt

So perhaps the GIL is blocking me. The natural inclination is to try to use multiprocessing to spread the work across multiple available CPUs. But doing so has the disadvantage that you can't pass around a non-pickleable object so you have to send just the filename to each future function:

def unzip_member_f3(zip_filepath, filename, dest):
    with open(zip_filepath, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extract(filename, dest)
    fn = os.path.join(dest, filename)
    return _count_file(fn)



def f3(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ProcessPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member_f3,
                        fn,
                        member.filename,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Result: ~300% faster

That's cheating!

The problem with using a pool of processors is that it requires that the original .zip file exists on disk. So in my web server, to use this solution, I'd first have to save the in-memory ZIP file to disk, then invoke this function. Not sure what the cost of that it's not likely to be cheap.

Well, it doesn't hurt to poke around. Perhaps, it could be worth it if the extraction was significantly faster.

But remember! This optimization depends on using up as many CPUs as it possibly can. What if some of those other CPUs are needed for something else going on in gunicorn? Those other processes would have to patiently wait till there's a CPU available. Since there's other things going on in this server, I'm not sure I'm willing to let on process take over all the other CPUs.

Conclusion

Doing it serially turns out to be quite nice. You're bound to one CPU but the performance is still pretty good. Also, just look at the difference in the code between f1 and f2! With concurrent.futures pool classes you can cap the number of CPUs it's allowed to use but that doesn't feel great either. What if you get the number wrong in a virtual environment? Of if the number is too low and don't benefit any from spreading the workload and now you're just paying for overheads to move the work around?

I'm going to stick with zipfile.ZipFile(file_buffer).extractall(temp_dir). It's good enough for this.

Want to try your hands on it?

I did my benchmarking using a c5.4xlarge EC2 server. The files can be downloaded from:

wget https://www.peterbe.com/unzip-in-parallel/hack.unzip-in-parallel.py
wget https://www.peterbe.com/unzip-in-parallel/symbols-2017-11-27T14_15_30.zip

The .zip file there is 34MB which is relatively small compared to what's happening on the server.

The hack.unzip-in-parallel.py is a hot mess. It contains a bunch of terrible hacks and ugly stuff but hopefully it's a start.

Comments

Anonymous
Consider using bzip2, xz or something similar which can be both compressed and decompressed in parallel. See pbzip2 and pixz. (Zip does not lend itself to this, so pigz would not help you too much under most circumstances.)

Choose the right UNIX tool, subprocesses and pipes are the way to go here and pipes can go across network boundaries as well - rely on the deep history of computing.
Peter Bengtsson
Zip files can be decompressed in parallel too. You first ask the zip file for its file contents (fast), then you start parallel processes (or threads or whatever) that each get a file name to extract.

My multiprocess example above is a proof of that.
Martin Bammer
You should not use ThreadPoolExecutor, because it has a relatively high overhead and is very slow. Especially if you use submit. Use a different thread pool implementation or if you want to stick with ThreadPoolExecutor use map. A much faster solution is the good old multiprocessing.pool.ThreadPool. Or you could also try my fastthreadpool module (https://github.com/brmmm3/fastthreadpool). This module is faster than ThreadPool and much faster than ThreadPoolExecutor. It also has the advantage that you can use generator functions as worker which is very useful in certain situations. For example code please look into benchmark.py. The doc directory contains some benchmark files which show the overhead difference between the 3 thread pool implementations. I'm also working on a fastprocesspool module but this is still not finished and buggy, First tests have shown that this module is also faster than multiprocessing.Pool and much fast than ProcessPoolExecutor.
Peter Bengtsson
What's the alternative to submit?
And multiprocessing.Pool might be marginally faster, but isn't the bulk of the computation still in the actual unzipping?
Martin Bammer
The bulk is indeed in unzipping. But if you've an archive with many small files the overhead of the pool can be 10% or more. And this is much for just handling a pool of threads. The alternative is to use map, where you have to prepare an iterable before calling map. Another alternative is to switch to a faster pool implementation.
The module zipfile is completely written in Python, which comes with a relatively big overhead at Python level, which in turn means the GIL is locked relatively long. The result is a low level of parallelisation. I'm currently writing an archiving tool which uses msgpack and zstd. Both libraries have a very thin Python layer and parallelisation with threads is very good. I get nearly 100% CPU load. The results currently are ~4 faster than zip and compression ratio between zip and lzma. When the tool is finished I'll release it for the public.
Peter Bengtsson
If you have an example of using fastthreadpool that I can use instead of concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor then I'll try to run a benchmark with that too.
Masklinn
Peter, if you don't need the entire content of the file in memory (which I guess you don't since you're apparently fine with reading it from disk despite in-memory decompression blowing up) *and* you don't need seekability, you may have missed the `open` method on zipfiles. You'll still be paying for the decompression work, and as noted previously you can only access data sequentially (no seeking backwards) but it *is* lazy, you ask for a 10k chunk of a file, you get 10k in memory (plus some overhead from Python).
Peter Bengtsson
Doing an open requires all things to be in-memory. It's a bit faster but would require me to have a much beefier server, which might be the best course of action actually.
Masklinn
> Doing an open requires all things to be in-memory.

It only requires that the zip file be accessible, either in-memory or on-disk.

Your introduction says that you originally had the zip file *and* the decompressed files in memory, ZipFile.open avoids the latter. Furthermore the small bit of code you post in the conclusion seems to hint that you still have the zip file in memory when you do the extraction.

And again, you can use ZipFile.open just fine if you have the zip file on-disk, it's still lazy.
jfs
It might be more efficient to count bytes by reading chunks instead of lines: `_count_file_object = lambda file: sum(map(len, iter(lambda: file.read(1<<15), b'')))` (assuming you don't want `os.path.getsize(filename)` or similar). Related: https://stackoverflow.com/questions/9371238/why-is-reading-lines-from-stdin-much-slower-in-c-than-python

Your email will never ever be published


Related posts

Previous:
Make .local domains NOT slow in macOS 29 January 2018
Next:
Convert web page to PDF, nicely 04 February 2018
Related by Keyword:
Unzip benchmark on AWS EC2 c3.large vs c4.large 29 November 2017
Mozilla Symbol Server (aka. Tecken) load testing 06 September 2017
A neat trick to zip a git repo with a version number 01 September 2017
Fastest Redis configuration for Django 11 May 2017
Fastest cache backend possible for Django 07 April 2017
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
I'm Prolog 01 May 2007
Ever wondered how much $87 Billion is? 04 November 2003