Fastest way to thousands-commafy large numbers in Python/PyPy

13 October 2012   15 comments   Python

Powered by Fusion×

Here are two perfectly good ways to turn 123456789 into "123,456,789":

import locale

def f1(n):
    locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    return locale.format('%d', n, True)

def f2(n):
    r = []
    for i, c in enumerate(reversed(str(n))):
        if i and (not (i % 3)):
            r.insert(0, ',')
        r.insert(0, c)
    return ''.join(r)

assert f1(123456789) == '123,456,789'
assert f2(123456789) == '123,456,789'    

Which one do you think is the fastest?

Easy, write a benchmark:

from time import time

for f in (f1, f2):
    t0 = time()
    for i in range(1000000):
        f(i)
    t1 = time()
    print f.func_name, t1 - t0, 'seconds'

And, drumroll, the results are:

peterbe@mpb:~$ python benchmark.py
f1 19.4571149349 seconds
f2 6.30253100395 seconds

The f2 one looks very plain and a good candidate for PyPy:

peterbe@mpb:~$ pypy dummy.py
f1 14.367814064 seconds
f2 0.77246594429 seconds

...which is 800% speed boost which is cute. It's also kinda ridiculous that each iteration of f2 takes 0.0000008 seconds. What's that!?

An obvious albeit somewhat risky optimization on f1 is this:

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
def f1(n):
    return locale.format('%d', n, True)

...and now we get:

peterbe@mpb:~$ python dummy.py
f1 16.3811080456 seconds
f2 6.14097189903 seconds

Before you say it, yes I'm aware the locale can do much more but I was just curious and I scratched it.

UPDATE

Dave points out the built in function format (which was added in Python 2.6). So let's add it and kick ass!

def f3(i):
    return format(i, ',')

And we run the tests again:

peterbe@mpb:~$ python dummy.py
f1 16.4227910042
f2 6.13625884056
f3 0.892002105713
peterbe@mpb:~$ pypy dummy.py
f1 4.61941003799
f2 0.720993041992
f3 0.26224398613

There's your winner!

Comments

Adam Forsyth
If we're talking about performance, I wouldn't call f2 a good method for this. While it's fine for small numbers, list.insert is O(n) for each insert, making f2 an O(n^2) operation, while the operation can easily be done in O(n) time since it requires only one pass over the data.

Two simple ways to rewrite f2 for O(n) operation are:

1. Use len() to calculate the offset from the left of the first comma so you can work left to right and use list.append instead of list.insert.

2. Use a collections.deque instead of a list to allow O(1) inserts at the beginning of the deque.

There are also a bunch of good solutions to this problem in http://stackoverflow.com/questions/1823058/how-to-print-number-with-commas-as-thousands-separators-in-python-2-x inlcuding the beautifully simple "'{:,}'.format(value)". I'd be interested to know how various methods perform on small vs. large numbers.
Dave
What about format(n, ',')?
Peter Bengtsson
It's new in 2.7 right? I guess it warrents to be included.
Peter Bengtsson
I actually didn't know about this one. I'm glad you mentioned it!
Senyai
+20% speed

    def f1(n):
        return ''.join(
            reversed(
                [
                c + ','
                if i != 0 and i % 3 == 0 else
                c
                for i, c in enumerate(reversed(str(n)))
                ]
            )
        )
Peter Bengtsson
That looks like f2() but laid out slightly differently.
Adam Skutt
There's nothing risky about that "optimization", as it is the right thing to do. The locale settings manged by locale.setlocale() are global. You shouldn't ever be setting them in a locale-using routine like that.
Peter Bengtsson
Bruno ReniƩ adds his version which improves on f2() by almost 100% in the CPython benchmark.
https://gist.github.com/7032f37e306bf1dd363a
Robert Helmer
I would've used the 2.7 format() for socorro-crashstats but we only have 2.6 on the Mozilla servers (RHEL 6) :( although honestly I care more about readability than perf unless it's shown to be a bottleneck for a use case we care about. However, this is still a very interesting discussion and I don't wish to discourage it :)

I've actually been playing with pypy a bit, even though we're super i/o-bound on Socorro. I think it's be interesting for certain type of analysis that Java is used for now, and also for better alternatives to our approach to threading (e.g. something safer/saner like STM) without having to write in some brutally different style like twisted/tornado/etc

Looking at local.py there's a lot going on there:
http://hg.python.org/cpython/file/5fc6f47974db/Lib/locale.py

This is related to the obvious downside to optimizing this - you are precluding localizing the app so it displays the appropriate digit group separator (note that "thousands" is not always the appropriate group, see http://en.wikipedia.org/wiki/Decimal_mark#Digit_grouping)
Peter Bengtsson
Performance would only matter if we did a spreadsheet app or something. We don't.

Right, in Swedish for example the price of a chewing gum is: SEK 0,5
Whereas here in the US it's USD 0.5
Neil Rashbrook
locale.setlocale(locale.LC_ALL, '')
def f5(n):
    return format('n', n)

Works in 2.6 (',' needs 2.7).
Peter Bengtsson
See the UPDATE to the post above.
Neil Rashbrook
Sorry, but I don't see any reference to the 'n' format in the post, its update, or any of the other comments.
Peter Bengtsson
Python 2.6.6 (r266:84292, Dec 5 2011, 09:38:23)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> format('n', 100000000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: format expects arg 2 to be string or unicode, not int
Neil Rashbrook
My apologies, I was working on two different computers and retyped the code snippet incorrectly. The correct code should of course be:

import locale
locale.setlocale(locale.LC_ALL, '')
def f5(n):
    return format(n, 'n')
Thank you for posting a comment

Your email will never ever be published


Related posts

Previous:
hastebinit - quickly paste snippets into hastebin.com 11 October 2012
Next:
Introducing: HUGEpic - a web app for showing massive pictures 03 November 2012
Related by keywords:
Fastest way to uniqify a list in Python 14 August 2006
mincss "Clears the junk out of your CSS" 21 January 2013
Gzip rules the world of optimization, often 09 August 2014
Optimization of getting random rows out of a PostgreSQL in Django 23 February 2011
mincss in action - sample report from the wild 22 January 2013
Optimizing MozTrap 04 June 2014
mincss version 0.8 is much much faster 27 February 2013
Python optimization anecdote 11 February 2005
DoneCal homepage now able to do 10,000 requests/second 13 February 2011
More optimization of Peterbe.com - CSS sprites 05 August 2009
Optimization story involving something silly I call "dict+" 13 June 2011
rfc822() vs. rfc1123_date() 16 August 2007