Fastest way to uniquify a list in Python >=3.6

23 December 2017   7 comments   Python

https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088

This is an update to a old blog post from 2006 called Fastest way to uniquify a list in Python. But this, time for Python 3.6. Why, because Python 3.6 preserves the order when inserting keys to a dictionary . How, because the way dicts are implemented in 3.6, the way it does that is different and as an implementation detail the order gets preserved. Then, in Python 3.7, which isn't released at the time of writing, that order preserving is guaranteed.

Anyway, Raymond Hettinger just shared a neat little way to uniqify a list. I thought I'd update my old post from 2006 to add list(dict.fromkeys('abracadabra')).

Functions

Reminder, there are two ways to uniqify a list. Order preserving and not order preserving. For example, the unique letters in peter is p, e, t, r in their "original order". As opposed to t, e, p, r.

def f1(seq):  
# Raymond Hettinger
    hash_ = {}
    [hash_.__setitem__(x, 1) for x in seq]
    return hash_.keys()
def f3(seq):
    # Not order preserving
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()
def f5(seq, idfun=None):  # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker in seen:
            continue
        seen[marker] = 1
        result.append(item)
    return result
def f5b(seq, idfun=None):  # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker not in seen:
            seen[marker] = 1
            result.append(item)

    return result
def f7(seq):
    # Not order preserving
    return list(set(seq))
def f8(seq):  # Dave Kirby
    # Order preserving
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]
def f9(seq):
    # Not order preserving, even in Py >=3.6
    return {}.fromkeys(seq).keys()
def f10(seq, idfun=None):  # Andrew Dalke
    # Order preserving
    return list(_f10(seq, idfun))

def _f10(seq, idfun=None):
    seen = set()
    if idfun is None:
        for x in seq:
            if x in seen:
                continue
            seen.add(x)
            yield x
    else:
        for x in seq:
            x = idfun(x)
            if x in seen:
                continue
            seen.add(x)
            yield x
def f11(seq):  # f10 but simpler
    # Order preserving
    return list(_f10(seq))
def f12(seq):
    # Raymond Hettinger
    # https://twitter.com/raymondh/status/944125570534621185
    return list(dict.fromkeys(seq))

Results

FUNCTION        ORDER PRESERVING     MEAN       MEDIAN
f12             yes                  111.0      112.2
f8              yes                  266.3      266.4
f10             yes                  304.0      299.1
f11             yes                  314.3      312.9
f5              yes                  486.8      479.7
f5b             yes                  494.7      498.0
f7              no                   95.8       95.1
f9              no                   100.8      100.9
f3              no                   143.7      142.2
f1              no                   406.4      408.4

Not order preserving

Order preserving

Conclusion

The fastest way to uniqify a list of hashable objects (basically immutable things) is:

list(set(seq))

And the fastest way, if the order is important is:

list(dict.fromkeys(seq))

Now we know.

Comments

Stefan Pochmann

What is this mysterious sequence "seq" that you're testing with? Can you please show the *whole* benchmark code?

Peter Bengtsson

See the link in the top to that gist.github.com

Stefan Pochmann

Thanks, missed that since it's "above the article". Other things:

- You say in Python 3.7 "order preserving is guaranteed" but following that link I only see Raymond saying "a guarantee for 3.7 is almost inevitable". Am I missing something again or are you overstating that?

- Better remove f2 from the list of functions to test, as it's suuuper slow.

- Would be good to include *min* as well, in addition to mean and median. As Veedrac argued at https://stackoverflow.com/a/24105845/1672429 - "all error in timing is positive [so] the shortest time has the least error in it". And *Raymond* uses min there as well (in his answer, the accepted one).

- I came up with four order-preserving solutions slower than Raymond's but much faster than the others. Then I also found them in the comments of the old blog post, ha. Here they are:

    def f13(seq):
        seen = set()
        add = seen.add
        return [add(x) or x for x in seq if x not in seen]
        return [add(x) or x for x in seq if not x in seen]
        return [x for x in seq if not (x in seen or add(x))]
        return [x for x in seq if x not in seen and not add(x)]

Stefan Pochmann

- f1, f3 and f9 aren't quite right. They don't return a list but a "dict_keys" view object, which isn't fair, as that takes only O(1) time.

- What makes you think f9 doesn't preserve order? I don't see why it wouldn't, and it did preserve order in some tests I just did, including your script (after fixing the "dict_keys" issue by wrapping "list(...)" around it).

- In the comments under Raymond's answer https://stackoverflow.com/a/8220943/1672429 he also very strongly suggested using *min*: "Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

Michael Kennedy

Nice analysis. f7 is now order preserving because it's backed by dictionaries, which are now going to be order preserving.

Michael Kennedy

Sorry, make that will be... when 3.7 comes out. :)

Meiyer Goren

Since Python 3.7, insertion-order dict’s are part of the language, as per the “official verdict” by BDFL: https://mail.python.org/pipermail/python-dev/2017-December/151283.html.

Your email will never ever be published


Related posts

Previous:
CSS selector simplifier regular expression in JavaScript 20 December 2017
Next:
How to rotate a video on OSX with ffmpeg 03 January 2018
Related by Keyword:
Merge two arrays without duplicates in JavaScript 20 September 2018
Items function in JavaScript for looping over dictionaries like Python 23 February 2018
Be careful with using dict() to create a copy 09 September 2015
In Python you sort with a tuple 14 June 2013
Fastest way to uniqify a list in Python 14 August 2006
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! Peterbe.com has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
Ever wondered how much $87 Billion is? 04 November 2003
Guake, not Yakuake or Yeahconsole 23 January 2010