29 March 2017 2 comments Python
tl;dr; You can download files from S3 with
requests.get() (whole or in stream) or use the
boto3 library. Although slight differences in speed, the network I/O dictates more than the relative implementation of how you do it.
I'm working on an application that needs to download relatively large objects from S3. Some files are gzipped and size hovers around 1MB to 20MB (compressed).
So what's the fastest way to download them? In chunks, all in one go or with the boto3 library? I should warn, if the object we're downloading is not publically exposed I actually don't even know how to download other than using the
boto3 library. In this experiment I'm only concerned with publicly available objects.
The simplest first. Note that in a real application you would do something more with the
r.content and not just return its size. And in fact you might want to get the
text out instead since that's encoded.
def f1(url): r = requests.get(url) return len(r.content)
If you stream it you can minimize memory bloat in your application since you can re-use the chunks of memory if you're able to do something with the buffered content. In this case, the buffer is just piled on in memory, 512 bytes at a time.
def f2(url): r = requests.get(url, stream=True) buffer = io.BytesIO() for chunk in r.iter_content(chunk_size=512): if chunk: buffer.write(chunk) return len(buffer.getvalue())
I did put a counter into that for-loop to see how many times it writes and if you multiple that with 512 or 1024 respectively it does add up.
f2() but with twice as large chunks/
def f3(url): # same as f2 but bigger chunk size r = requests.get(url, stream=True) buffer = io.BytesIO() for chunk in r.iter_content(chunk_size=1024): if chunk: buffer.write(chunk) return len(buffer.getvalue())
I'm actually quite new to
boto3 (the cool thing was to use
boto before) and from some StackOverflow-surfing I found this solution to support downloading of gzipped or non-gzipped objects into a buffer:
def f4(url): _, bucket_name, key = urlparse(url).path.split('/', 2) obj = s3.Object( bucket_name=bucket_name, key=key ) buffer = io.BytesIO(obj.get()["Body"].read()) try: got_text = GzipFile(None, 'rb', fileobj=buffer).read() except OSError: buffer.seek(0) got_text = buffer.read() return len(got_text)
Note how it doesn't try to find out if the buffer is gzipped but instead relying on assuming it is plus a raised exception.
This feels clunky, around the "gunzipping", but it's probably quite representative of a final solution.
At first I ran this on my laptop here on my decent home broadband whilst having lunch. The results were very similar to what I later found on EC2 but 7-10 times slower here. So let's focus on the results from within an EC2 node in us-west-1c.
The raw numbers are as follows (showing median values):
|Function||18MB file||Std Dev||1MB file||Std Dev|
I ran each function 20 times. It's interesting, but not totally surprising that the function that was fastest for the large file wasn't necessarily the fastest for the smaller file.
The winners are
f4() both with one gold and one silver each. Makes sense because it's often faster to do big things, over the network, all at once.
With a tiny margin,
f4() are slightly faster but they are not as convenient because they're not streams. In
f3() you have the ability to do something constructive with the stream. As a matter of fact, in my application I want to download the S3 object and parse it line by line so I can use
response.iter_lines() which makes this super convenient.
But most importantly, I think we can conclude that it doesn't matter much how you do it. Network I/O is still king.
boto3 solution has the advantage that with credentials set right it can download objects from a private S3 bucket.
This experiment was conducted on a
m3.xlarge in us-west-1c. That 18MB file is a compressed file that, when unpacked, is 81MB. This little Python code basically managed to download 81MB in about 1 second. Yay!! The future is here and it's awesome.