tl;dr; It's faster to list objects with prefix being the full key path, than to use HEAD to find out of a object is in an S3 bucket.
I have a piece of code that opens up a user uploaded
.zip file and extracts its content. Then it uploads each file into an AWS S3 bucket if the file size is different or if the file didn't exist at all before.
It looks like this:
for filename, filesize, fileobj in extract(zip_file): size = _size_in_s3(bucket, filename) if size is None or size != filesize: upload_to_s3(bucket, filename, fileobj) print('Updated!' if size else 'New!') else: print('Ignored')
I'm using the boto3 S3 client so there are two ways to ask if the object exists and get its metadata.
Option 1: client.head_object
Option 2: client.list_objects_v2 with
The problem with
client.head_object is that it's odd in how it works. Sane but odd. If the object does not exist, boto3 raises a
botocore.exceptions.ClientError which contains a
response and in it you can look for
exception.response['Error']['Code'] == '404'.
What I noticed was that if you use a
approach to figure out if an object exists, you reset the client's connection pool in
urllib3. So after an exception has happened, any other operations on the client causes it to have to, internally, create a new HTTPS connection. That can cost time.
So I wrote two different functions to return an object's size if it exists:
def _key_existing_size__head(client, bucket, key): """return the key's size if it exist, else None""" try: obj = client.head_object(Bucket=bucket, Key=key) return obj['ContentLength'] except ClientError as exc: if exc.response['Error']['Code'] != '404': raise
And the contender...:
def _key_existing_size__list(client, bucket, key): """return the key's size if it exist, else None""" response = client.list_objects_v2( Bucket=bucket, Prefix=key, ) for obj in response.get('Contents', ): if obj['Key'] == key: return obj['Size']
They both work. That was easy to test. But which is fastest?
Before we begin, which do you think is fastest? The
head_object feels like it'll be able to send an operation to S3 internally to do a key lookup directly. But S3 isn't a normal database.
Here's the script partially cleaned up but should be easy to run.
So I wrote a loop that ran 1,000 times and I made sure the bucket was empty so that 1,000 times the result of the iteration is that it sees that the file doesn't exist and it has to do a
Here are the results:
FUNCTION: _key_existing_size__list Used 511 times SUM 148.2740752696991 MEAN 0.2901645308604679 MEDIAN 0.2569708824157715 STDEV 0.17742598775696436 FUNCTION: _key_existing_size__head Used 489 times SUM 249.79622673988342 MEAN 0.510830729529414 MEDIAN 0.4780092239379883 STDEV 0.14352671121877011
Because it's network bound, it's really important to avoid the 'MEAN' and instead look at the 'MEDIAN'. My home broadband can cause temporary spikes.
client.list_objects_v2 is faster. It's 90% faster than
But note! this was 1,000 times of B) "does the file already exist?" and B) "No? Ok upload it". So the times there include all the
So why did I measure both? I.e.
? The reason is that the approach of using
try:except ClientError: followed by a
boto3 to create a new HTTPS connection in its pool. Again, see the issue which demonstrates this in different words.
So, I simply run the benchmark again. The first time, it uploaded all 1,000 uniquely named objects. So running it a second time, every time the answer is that the object exists, and its size hasn't changed, so it never triggers the
Here are the results this time:
FUNCTION: _key_existing_size__list Used 495 times SUM 54.60546112060547 MEAN 0.11031406286991004 MEDIAN 0.08583354949951172 STDEV 0.06339202669609442 FUNCTION: _key_existing_size__head Used 505 times SUM 44.59347581863403 MEAN 0.0883039125121466 MEDIAN 0.07310152053833008 STDEV 0.054452842190700346
In this case, using
client.head_object is faster. By 20% but the median time is 0.08 seconds! Even on a home broadband connection. In other words, I don't think that difference is significant.
The point of using
client.list_objects_v2 instead of
client.head_object was to avoid breaking the connection pool in
manages somehow. Having to create a new HTTPS connection (and adding it to the pool) costs time, but what if we disregard that and compare the two functions "purely" on how long they take when the file does NOT exist? Remember, the second measurement above was when every object exists.
So we know it took 0.09 seconds and 0.07 seconds respectively for the two functions to figure out that the object does exist. How long does it take to figure out that the object does not exist independent of any other op. I.e. just try each one without doing a
client.put_object afterwards. That means we avoid the bug so the comparison is fair.
FUNCTION: _key_existing_size__list Used 499 times SUM 123.57429671287537 MEAN 0.247643881188127 MEDIAN 0.2196049690246582 STDEV 0.18622877427652743 FUNCTION: _key_existing_size__head Used 501 times SUM 112.99495434761047 MEAN 0.22553883103315464 MEDIAN 0.2828958034515381 STDEV 0.15342842113446084
client.head_object by 30%. And it matters. Above I said that 20% difference didn't matter but now it does. That's because the time difference when it always finds the object was 0.013 seconds. When it comes to figuring out that the object did not exist the time difference is 0.063 seconds. That's still a pretty small number but, hey, you gotto draw the line somewhere.
client.list_objects_v2 is a better alternative to using
If you think you'll often find that the object doesn't exist and needs a
client.put_object then using
client.list_objects_v2 is 90% faster. If you think you'll rarely need
client.put_object (i.e. that most objects don't change) then
client.list_objects_v2 is almost the same performance.