The bulk is indeed in unzipping. But if you've an archive with many small files the overhead of the pool can be 10% or more. And this is much for just handling a pool of threads. The alternative is to use map, where you have to prepare an iterable before calling map. Another alternative is to switch to a faster pool implementation. The module zipfile is completely written in Python, which comes with a relatively big overhead at Python level, which in turn means the GIL is locked relatively long. The result is a low level of parallelisation. I'm currently writing an archiving tool which uses msgpack and zstd. Both libraries have a very thin Python layer and parallelisation with threads is very good. I get nearly 100% CPU load. The results currently are ~4 faster than zip and compression ratio between zip and lzma. When the tool is finished I'll release it for the public.
What's the alternative to submit?
And multiprocessing.Pool might be marginally faster, but isn't the bulk of the computation still in the actual unzipping?
If you have an example of using fastthreadpool that I can use instead of concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor then I'll try to run a benchmark with that too.
Comment
The bulk is indeed in unzipping. But if you've an archive with many small files the overhead of the pool can be 10% or more. And this is much for just handling a pool of threads. The alternative is to use map, where you have to prepare an iterable before calling map. Another alternative is to switch to a faster pool implementation.
The module zipfile is completely written in Python, which comes with a relatively big overhead at Python level, which in turn means the GIL is locked relatively long. The result is a low level of parallelisation. I'm currently writing an archiving tool which uses msgpack and zstd. Both libraries have a very thin Python layer and parallelisation with threads is very good. I get nearly 100% CPU load. The results currently are ~4 faster than zip and compression ratio between zip and lzma. When the tool is finished I'll release it for the public.
Parent comment
What's the alternative to submit? And multiprocessing.Pool might be marginally faster, but isn't the bulk of the computation still in the actual unzipping?
Replies
If you have an example of using fastthreadpool that I can use instead of concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor then I'll try to run a benchmark with that too.