Fastest way to download a file from S3

29 March 2017   3 comments   Python

Powered by Fusion×

tl;dr; You can download files from S3 with requests.get() (whole or in stream) or use the boto3 library. Although slight differences in speed, the network I/O dictates more than the relative implementation of how you do it.

I'm working on an application that needs to download relatively large objects from S3. Some files are gzipped and size hovers around 1MB to 20MB (compressed).

So what's the fastest way to download them? In chunks, all in one go or with the boto3 library? I should warn, if the object we're downloading is not publically exposed I actually don't even know how to download other than using the boto3 library. In this experiment I'm only concerned with publicly available objects.

The Functions

f1()

The simplest first. Note that in a real application you would do something more with the r.content and not just return its size. And in fact you might want to get the text out instead since that's encoded.

def f1(url):
    r = requests.get(url)
    return len(r.content)

f2()

If you stream it you can minimize memory bloat in your application since you can re-use the chunks of memory if you're able to do something with the buffered content. In this case, the buffer is just piled on in memory, 512 bytes at a time.

def f2(url):
    r = requests.get(url, stream=True)
    buffer = io.BytesIO()
    for chunk in r.iter_content(chunk_size=512):
        if chunk:
            buffer.write(chunk)
    return len(buffer.getvalue())

I did put a counter into that for-loop to see how many times it writes and if you multiple that with 512 or 1024 respectively it does add up.

f3()

Same as f2() but with twice as large chunks/

def f3(url):  # same as f2 but bigger chunk size
    r = requests.get(url, stream=True)
    buffer = io.BytesIO()
    for chunk in r.iter_content(chunk_size=1024):
        if chunk:
            buffer.write(chunk)
    return len(buffer.getvalue())

f4()

I'm actually quite new to boto3 (the cool thing was to use boto before) and from some StackOverflow-surfing I found this solution to support downloading of gzipped or non-gzipped objects into a buffer:

def f4(url):
    _, bucket_name, key = urlparse(url).path.split('/', 2)
    obj = s3.Object(
        bucket_name=bucket_name,
        key=key
    )
    buffer = io.BytesIO(obj.get()["Body"].read())
    try:
        got_text = GzipFile(None, 'rb', fileobj=buffer).read()
    except OSError:
        buffer.seek(0)
        got_text = buffer.read()
    return len(got_text)

Note how it doesn't try to find out if the buffer is gzipped but instead relying on assuming it is plus a raised exception.
This feels clunky, around the "gunzipping", but it's probably quite representative of a final solution.

Complete experiment code here

The Results

At first I ran this on my laptop here on my decent home broadband whilst having lunch. The results were very similar to what I later found on EC2 but 7-10 times slower here. So let's focus on the results from within an EC2 node in us-west-1c.

The raw numbers are as follows (showing median values):

Function 18MB file Std Dev 1MB file Std Dev
f1 1.053s 0.492s 0.395s 0.104s
f2 1.742s 0.314s 0.398s 0.064s
f3 1.393s 0.727s 0.388s 0.08s
f4 1.135s 0.09s 0.264s 0.079s

I ran each function 20 times. It's interesting, but not totally surprising that the function that was fastest for the large file wasn't necessarily the fastest for the smaller file.

The winners are f1() and f4() both with one gold and one silver each. Makes sense because it's often faster to do big things, over the network, all at once.

Or, are there winners at all?

With a tiny margin, f1() and f4() are slightly faster but they are not as convenient because they're not streams. In f2() and f3() you have the ability to do something constructive with the stream. As a matter of fact, in my application I want to download the S3 object and parse it line by line so I can use response.iter_lines() which makes this super convenient.

But most importantly, I think we can conclude that it doesn't matter much how you do it. Network I/O is still king.

Lastly, that boto3 solution has the advantage that with credentials set right it can download objects from a private S3 bucket.

Bonus Thought!

This experiment was conducted on a m3.xlarge in us-west-1c. That 18MB file is a compressed file that, when unpacked, is 81MB. This little Python code basically managed to download 81MB in about 1 second. Yay!! The future is here and it's awesome.

Comments

Kevin J Buchs
Sure looks like you should have picked f1 and f4 as your winners.
Peter Bengtsson
Yes. Grrr. Typo. Will fix when I'm on a computer.
raj
Nice, I am exactly looking for this kind of analysis
Thank you for posting a comment

Your email will never ever be published


Related posts

Previous:
More CSS load performance experiments (of a React app) 25 March 2017
Next:
Fastest cache backend possible for Django 07 April 2017
Related:
Best practice with retries with requests 19 April 2017
Cope with JSONDecodeError in requests.get().json() in Python 2 and 3 16 November 2016
How to track Google Analytics pageviews on non-web requests (with Python) 03 May 2016
Crash-stats just became a whole lot faster 25 August 2015
How to test if gzip is working on your site or app 20 August 2015
Bye bye AWS EC2, Hello Digital Ocean 03 November 2014
How I back up all my photos on S3 via Dropbox 28 August 2014
Gzip rules the world of optimization, often 09 August 2014
All my apps are now running on one EC2 server 03 November 2013
HTML whitespace "compression" - don't bother! 11 March 2013