Fastest Python datetime parser

02 May 2018   0 comments   Python

https://github.com/closeio/ciso8601

tl;dr; It's ciso8601.

I have this Python app that I'm working on. It has a cron job that downloads a listing of every single file listed in an S3 bucket. AWS S3 publishes a manifest of .csv.gz files. You download the manifest and for each hashhashash.csv.gz you download those files. Then my program reads these CSV files and it is able to ignore certain rows based on them being beyond the retention period. It basically parses the ISO formatted datetime as a string, compares it with a cutoff datetime.datetime instance and is able to quickly skip or allow it in for full processing.

At the time of writing, it's roughly 160 .csv.gz files weighing a total of about 2GB. In total it's about 50 millions rows of CSV. That means it's 50 million datetime parsings.

I admit, this cron job doesn't have to be super fast and it's OK if it takes an hour since it's just a cron job running on a server in the cloud somewhere. But I would like to know, is there a way to speed up the date parsing because that feels expensive to do in Python 50 million times per day.

Here's the benchmark:

import csv
import datetime
import random
import statistics
import time

import ciso8601


def f1(datestr):
    return datetime.datetime.strptime(datestr, '%Y-%m-%dT%H:%M:%S.%fZ')


def f2(datestr):
    return ciso8601.parse_datetime(datestr)


def f3(datestr):
    return datetime.datetime(
        int(datestr[:4]),
        int(datestr[5:7]),
        int(datestr[8:10]),
        int(datestr[11:13]),
        int(datestr[14:16]),
        int(datestr[17:19]),
    )


# Assertions
assert f1(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f2(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == f3(
    '2017-09-21T12:54:24.000Z'
).strftime('%Y%m%d%H%M') == '201709211254'


functions = f1, f2, f3
times = {f.__name__: [] for f in functions}

with open('046444ae07279c115edfc23ba1cd8a19.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        func = random.choice(functions)
        t0 = time.clock()
        func(row[3])
        t1 = time.clock()
        times[func.__name__].append((t1 - t0) * 1000)


def ms(number):
    return '{:.5f}ms'.format(number)


for name, numbers in times.items():
    print('FUNCTION:', name, 'Used', format(len(numbers), ','), 'times')
    print('\tBEST  ', ms(min(numbers)))
    print('\tMEDIAN', ms(statistics.median(numbers)))
    print('\tMEAN  ', ms(statistics.mean(numbers)))
    print('\tSTDEV ', ms(statistics.stdev(numbers)))

Yeah, it's a bit ugly but it works. Here's the output:

FUNCTION: f1 Used 111,475 times
    BEST   0.01300ms
    MEDIAN 0.01500ms
    MEAN   0.01685ms
    STDEV  0.00706ms
FUNCTION: f2 Used 111,764 times
    BEST   0.00100ms
    MEDIAN 0.00200ms
    MEAN   0.00197ms
    STDEV  0.00167ms
FUNCTION: f3 Used 111,362 times
    BEST   0.00300ms
    MEDIAN 0.00400ms
    MEAN   0.00409ms
    STDEV  0.00225ms

In summary:

Or, if you compare to the slowest (f1):

UPDATE

If you know with confidence that you don't want or need timezone aware datetime instances, you can use csiso8601.parse_datetime_unaware instead.

from the README:

"Please note that it takes more time to parse aware datetimes, especially if they're not in UTC. If you don't care about time zone information, use the parse_datetime_unaware method, which will discard any time zone information and is faster."

In my benchmark the strings I use look like this 2017-09-21T12:54:24.000Z. I added another function to the benmark that uses ciso8601.parse_datetime_unaware and it clocked in at the exact same time as f2.

Comments

Your email will never ever be published


Related posts

Previous:
The impressive first-meaningful-paint improvement of using minimalcss 24 April 2018
Next:
gtop is best 02 May 2018
Related by Keyword:
Rust > Go > Python ...to parse millions of dates in CSV files 15 May 2018
Django forms and making datetime inputs localized 04 December 2015
rfc822() vs. rfc1123_date() 16 August 2007
Date formatting in Python or in PostgreSQL (part II) 19 April 2006
Date formatting in python or in PostgreSQL 20 July 2004
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010