tl;dr; It's ciso8601.
I have this Python app that I'm working on. It has a cron job that downloads a listing of every single file listed in an S3 bucket. AWS S3 publishes a manifest of .csv.gz files. You download the manifest and for each hashhashash.csv.gz you download those files. Then my program reads these CSV files and it is able to ignore certain rows based on them being beyond the retention period. It basically parses the ISO formatted datetime as a string, compares it with a cutoff datetime.datetime instance and is able to quickly skip or allow it in for full processing.
At the time of writing, it's roughly 160 .csv.gz files weighing a total of about 2GB. In total it's about 50 millions rows of CSV. That means it's 50 million datetime parsings.
I admit, this cron job doesn't have to be super fast and it's OK if it takes an hour since it's just a cron job running on a server in the cloud somewhere. But I would like to know, is there a way to speed up the date parsing because that feels expensive to do in Python 50 million times per day.
Here's the benchmark:
import csv
import datetime
import random
import statistics
import time
import ciso8601
def f1(datestr):
return datetime.datetime.strptime(datestr, '%Y-%m-%dT%H:%M:%S.%fZ')
def f2(datestr):
return ciso8601.parse_datetime(datestr)
def f3(datestr):
return datetime.datetime(
int(datestr[:4]),
int(datestr[5:7]),
int(datestr[8:10]),
int(datestr[11:13]),
int(datestr[14:16]),
int(datestr[17:19]),
)
# Assertions
assert f1(
'2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f2(
'2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f3(
'2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == '201709211254'
functions = f1, f2, f3
times = {f.__name__: [] for f in functions}
with open('046444ae07279c115edfc23ba1cd8a19.csv') as f:
reader = csv.reader(f)
for row in reader:
func = random.choice(functions)
t0 = time.clock()
func(row[3])
t1 = time.clock()
times[func.__name__].append((t1 - t0) * 1000)
def ms(number):
return '{:.5f}ms'.format(number)
for name, numbers in times.items():
print('FUNCTION:', name, 'Used', format(len(numbers), ','), 'times')
print('\tBEST ', ms(min(numbers)))
print('\tMEDIAN', ms(statistics.median(numbers)))
print('\tMEAN ', ms(statistics.mean(numbers)))
print('\tSTDEV ', ms(statistics.stdev(numbers)))
Yeah, it's a bit ugly but it works. Here's the output:
FUNCTION: f1 Used 111,475 times
BEST 0.01300ms
MEDIAN 0.01500ms
MEAN 0.01685ms
STDEV 0.00706ms
FUNCTION: f2 Used 111,764 times
BEST 0.00100ms
MEDIAN 0.00200ms
MEAN 0.00197ms
STDEV 0.00167ms
FUNCTION: f3 Used 111,362 times
BEST 0.00300ms
MEDIAN 0.00400ms
MEAN 0.00409ms
STDEV 0.00225ms
In summary:
f1: 0.01300 millisecondsf2: 0.00100 millisecondsf3: 0.00300 milliseconds
Or, if you compare to the slowest (f1):
f1: baselinef2: 13 times fasterf3: 6 times faster
UPDATE
If you know with confidence that you don't want or need timezone aware datetime instances, you can use csiso8601.parse_datetime_unaware instead.
from the README:
"Please note that it takes more time to parse aware datetimes, especially if they're not in UTC. If you don't care about time zone information, use the parse_datetime_unaware method, which will discard any time zone information and is faster."
In my benchmark the strings I use look like this 2017-09-21T12:54:24.000Z. I added another function to the benmark that uses ciso8601.parse_datetime_unaware and it clocked in at the exact same time as f2.
Comments