24 June 2017 7 comments Python
Suppose that you have a function and you wonder, "Can I make this faster?" Well, you might already have thought that and you might already have a theory. Or two. Or three. Your theory might be sound and likely to be right, but before you go anywhere with it you need to benchmark it first. Here are some tips and scaffolding for doing Python function benchmark comparisons.
Internally, Python will warm up and it's likely that your function depends on other things such as databases or IO. So it's important that you don't test function1
first and then function2
immediately after because function2
might benefit from a warm up painfully paid for by function1
. So mix up the order of them or cycle through them enough that they all pay for or gain from warm ups.
Look at the median first. The mean (aka. average) is often tainted by spikes and these spikes of slow-down can be caused by your local Spotify client deciding to reindex itself or something some such. Sometimes those spikes matter. For example, garbage collection is inevitable and will have an effect that matters.
Run your functions many times. So many times that the whole benchmark takes a while. Like tens of seconds or more. Also, if you run it significantly long it's likely that all candidates gets punished by the same environmental effects such as garbage collection or CPU being reassinged to something else intensive on your computer.
Try to take your benchmark into different, and possibly more realistic environments. For example, don't rely on reading a file like /Users/peterbe/only/on/my/macbook
when, likely, the end destination for your code is an Ubuntu server in AWS. Write your code so that it's easy to copy and paste around, like into a vi/jed editor in an ssh session somewhere.
Sanity check each function before benchmarking them. No need for pytest
or anything fancy but just make sure that you test them in some basic way. But the assertion testing is likely to add to the total execution time so don't do it when running the functions.
Avoid "prints" inside the time measured code. A print()
is I/O and an "external resource" that can become very unfair to compare CPU bound performance.
Don't fear testing many different functions. If you have multiple ideas of doing a function differently, it's cheap to pile them on. But be careful how you "report" because if there are many different ways of doing something you might accidentally compare different fruit without noticing.
Make sure your functions take at least one parameter. I'm no Python core developer or C hacker but I know there are "murks" within a compiler and interpreter that might do what a regular memoizer might done. Also, the performance difference can be reversed on tiny inputs compared to really large ones.
Be humble with the fact that 0.01 milliseconds difference when doing 10,000 iterations is probably not worth writing a more complex and harder-to-debug function.
Let's demonstrate with an example:
# The functions to compare
import math
def f1(degrees):
return math.cos(degrees)
def f2(degrees):
e = 2.718281828459045
return (
(e**(degrees * 1j) + e**-(degrees * 1j)) / 2
).real
# Assertions
assert f1(100) == f2(100) == 0.862318872287684
assert f1(1) == f2(1) == 0.5403023058681398
# Reporting
import time
import random
import statistics
functions = f1, f2
times = {f.__name__: [] for f in functions}
for i in range(100000): # adjust accordingly so whole thing takes a few sec
func = random.choice(functions)
t0 = time.time()
func(i)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
for name, numbers in times.items():
print('FUNCTION:', name, 'Used', len(numbers), 'times')
print('\tMEDIAN', statistics.median(numbers))
print('\tMEAN ', statistics.mean(numbers))
print('\tSTDEV ', statistics.stdev(numbers))
Let's break that down a bit.
The first area (# The functions to compare
) is all up to you. This silly example tries to peg Python's builtin math.cos
against your own arithmetic expression.
The second area (# Assertions
) is where you do some basic sanity checks/tests. This comes in handy to make sure if you keep modifying the functions more and more to try to squeeze out some extra juice.
The last area (# Reporting
) is the boilerplat'y area. You obviously have to change the line functions = f1, f2
to include all the named functions you have in the first area. And the number of iterations totally depends on how long the functions take to run. Here it's 100,000 times which is kinda ridiculous but I just needed a dead simple function to demonstrate.
Note that each measurement is in milliseconds.
You run that and get something like this:
FUNCTION: f1 Used 49990 times MEDIAN 0.0 MEAN 0.00045161219591330375 STDEV 0.0011268475946446341 FUNCTION: f2 Used 50010 times MEDIAN 0.00095367431640625 MEAN 0.0009188626294516487 STDEV 0.000642871632138125
The example above already broke one of the tenets in that these functions were simply too fast. Doing rather basic mathematics is just too fast to compare with such a trivial benchmark. Here are some other examples:
# The functions to compare
def f1(seq):
checked = []
for e in seq:
if e not in checked:
checked.append(e)
return checked
def f2(seq):
checked = []
seen = set()
for e in seq:
if e not in seen:
checked.append(e)
seen.add(e)
return checked
def f3(seq):
checked = []
[checked.append(i) for i in seq if not checked.count(i)]
return checked
def f4(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def f5(seq):
def generator():
seen = set()
for x in seq:
if x not in seen:
seen.add(x)
yield x
return list(generator())
# Assertion
import random
def _random_seq(length):
seq = []
for _ in range(length):
seq.append(random.choice(
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
))
return seq
L = list('abca')
assert f1(L) == f2(L) == f3(L) == f4(L) == f5(L) == list('abc')
L = _random_seq(10)
assert f1(L) == f2(L) == f3(L) == f4(L) == f5(L)
# Reporting
import time
import statistics
functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}
for i in range(3000):
seq = _random_seq(i)
for _ in range(len(functions)):
func = random.choice(functions)
t0 = time.time()
func(seq)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
for name, numbers in times.items():
print('FUNCTION:', name, 'Used', len(numbers), 'times')
print('\tMEDIAN', statistics.median(numbers))
print('\tMEAN ', statistics.mean(numbers))
print('\tSTDEV ', statistics.stdev(numbers))
Results:
FUNCTION: f1 Used 3029 times MEDIAN 0.6871223449707031 MEAN 0.6917867380307822 STDEV 0.42611748137761174 FUNCTION: f2 Used 2912 times MEDIAN 0.054955482482910156 MEAN 0.05610262627130026 STDEV 0.03000829926668248 FUNCTION: f3 Used 2985 times MEDIAN 1.4472007751464844 MEAN 1.4371055654145566 STDEV 0.888658217522005 FUNCTION: f4 Used 2965 times MEDIAN 0.051975250244140625 MEAN 0.05343245816673035 STDEV 0.02957275548477728 FUNCTION: f5 Used 3109 times MEDIAN 0.05507469177246094 MEAN 0.05678296204202234 STDEV 0.031521596461048934
Winner:
def f4(seq):
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
# The functions to compare
import codecs
import subprocess
def f1(filename):
count = 0
with codecs.open(filename, encoding='utf-8', errors='ignore') as f:
for line in f:
count += 1
return count
def f2(filename):
with codecs.open(filename, encoding='utf-8', errors='ignore') as f:
return len(f.read().splitlines())
def f3(filename):
return int(subprocess.check_output(['wc', '-l', filename]).split()[0])
# Assertion
filename = 'big.csv'
assert f1(filename) == f2(filename) == f3(filename) == 9999
# Reporting
import time
import statistics
import random
functions = f1, f2, f3
times = {f.__name__: [] for f in functions}
filenames = 'dummy.py', 'hacker_news_data.txt', 'yarn.lock', 'big.csv'
for _ in range(200):
for fn in filenames:
for func in functions:
t0 = time.time()
func(fn)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
for name, numbers in times.items():
print('FUNCTION:', name, 'Used', len(numbers), 'times')
print('\tMEDIAN', statistics.median(numbers))
print('\tMEAN ', statistics.mean(numbers))
print('\tSTDEV ', statistics.stdev(numbers))
Results:
FUNCTION: f1 Used 800 times MEDIAN 5.852460861206055 MEAN 25.403797328472137 STDEV 37.09347378640582 FUNCTION: f2 Used 800 times MEDIAN 0.45299530029296875 MEAN 2.4077045917510986 STDEV 3.717931526478758 FUNCTION: f3 Used 800 times MEDIAN 2.8804540634155273 MEAN 3.4988239407539368 STDEV 1.3336427480808102
Winner:
def f2(filename):
with codecs.open(filename, encoding='utf-8', errors='ignore') as f:
return len(f.read().splitlines())
No conclusion really. Just wanted to point out that this is just a hint of a decent start when doing performance benchmarking of functions.
There is also the timeit built-in for "provides a simple way to time small bits of Python code" but it has the disadvantage that your functions are not allowed to be as complex. Also, it's harder to generate multiple different fixtures to feed your functions without that fixture generation effecting the times.
There's a lot of things that this boilerplate can improve such as sorting by winner, showing percentages comparisons against the fastests, ASCII graphs, memory allocation differences, etc. That's up to you.
Nice article! I would recommend using the perf module for micro benchmarks: https://perf.readthedocs.io/en/latest/examples.html
The author made great work trying to have stable benchmarks: https://haypo.github.io/journey-to-stable-benchmark-average.html
I'm so glad I blogged. So I would get a comment like that. Thanks!
Really useful, well written and comprehensive article. Thank you for you effort!
This is great, easy to understand and useful! Thanks!
This is great! Thank you. Can't find a decent package that wraps all this for an "easy to use" solution. There must be some out there, but can't find it.
Hey Peter,
Thanks for this blogpost!
Sadly, Haypo's repository on Github is not available anymore.
Looks like haypo is now vstinner, so the links above would be https://github.com/psf/pyperf/ and https://vstinner.github.io/journey-to-stable-benchmark-average.html