Fastest way to unzip a zip file in Python

Wednesday, Jan 31, 2018

⬅︎ Back to Fastest way to unzip a zip file in Python

Comment

Martin Bammer February 5, 2018

The bulk is indeed in unzipping. But if you've an archive with many small files the overhead of the pool can be 10% or more. And this is much for just handling a pool of threads. The alternative is to use map, where you have to prepare an iterable before calling map. Another alternative is to switch to a faster pool implementation.
The module zipfile is completely written in Python, which comes with a relatively big overhead at Python level, which in turn means the GIL is locked relatively long. The result is a low level of parallelisation. I'm currently writing an archiving tool which uses msgpack and zstd. Both libraries have a very thin Python layer and parallelisation with threads is very good. I get nearly 100% CPU load. The results currently are ~4 faster than zip and compression ratio between zip and lzma. When the tool is finished I'll release it for the public.

Parent comment

Peter Bengtsson February 4, 2018

What's the alternative to submit? And multiprocessing.Pool might be marginally faster, but isn't the bulk of the computation still in the actual unzipping?

Replies

Peter Bengtsson February 6, 2018

If you have an example of using fastthreadpool that I can use instead of concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor then I'll try to run a benchmark with that too.