Fastest way to uniqify a list in Python

Monday, Aug 14, 2006
92 comments Python

SEE UPDATE BELOW

Suppose you have a list in python that looks like this:


['a','b','a']
# or like this:
[1,2,2,2,3,4,5,6,6,6,6]

and you want to remove all duplicates so you get this result:


['a','b']
# or
[1,2,3,4,5,6]

How do you do that? ...the fastest way? I wrote a couple of alternative implementations and did a quick benchmark loop on the various implementations to find out which way was the fastest. (I haven't looked at memory usage). The slowest function was 78 times slower than the fastest function.

However, there's one very important difference between the various functions. Some are order preserving and some are not. For example, in an order preserving function, apart from the duplicates, the order is guaranteed to be the same as it was inputted. Eg, uniqify([1,2,2,3])==[1,2,3]

Here are the functions:


def f1(seq):
   # not order preserving
   set = {}
   map(set.__setitem__, seq, [])
   return set.keys()

def f2(seq): 
   # order preserving
   checked = []
   for e in seq:
       if e not in checked:
           checked.append(e)
   return checked

def f3(seq):
   # Not order preserving
   keys = {}
   for e in seq:
       keys[e] = 1
   return keys.keys()

def f4(seq): 
   # order preserving
   noDupes = []
   [noDupes.append(i) for i in seq if not noDupes.count(i)]
   return noDupes

def f5(seq, idfun=None): 
   # order preserving
   if idfun is None:
       def idfun(x): return x
   seen = {}
   result = []
   for item in seq:
       marker = idfun(item)
       # in old Python versions:
       # if seen.has_key(marker)
       # but in new ones:
       if marker in seen: continue
       seen[marker] = 1
       result.append(item)
   return result

def f6(seq):
   # Not order preserving    
   set = Set(seq)
   return list(set)

And what you've all been waiting for (if you're still reading). Here are the results:


* f2 13.24
* f4 11.73
* f5 0.37
f1 0.18
f3 0.17
f6 0.19

(* order preserving)

Clearly f5 is the "best" solution. Not only is it really really fast; it's also order preserving and supports an optional transform function which makes it possible to do this:


>>> a=list('ABeeE')
>>> f5(a)
['A','B','e','E']
>>> f5(a, lambda x: x.lower())
['A','B','e']

Download the benchmark script here

UPDATE

From the comments I've now added a couple of more functions to the benchmark. Some which don't support uniqify a list of objects that can't be hashed unless passed with a special hashing method. So see all the functions download the file

Here are the new results:


* f5 10.1
* f5b 9.99
* f8 6.49
* f10 6.57
* f11 6.6
f1 4.28
f3 3.55
f6 4.03
f7 2.59
f9 2.58

(f2 and f4) were too slow for this testdata.

UPDATE Dec 2017

Fastest way to uniquify a list in Python >=3.6

Comments

Post your own comment

Lawrence Oluyede August 14, 2006

Keep in mind you can also use:

>>> lst = [1, 1, 3, 4, 4, 5, 6, 7, 6]
>>> set(lst)
set([1, 3, 4, 5, 6, 7])

Paul Boddie August 14, 2006

Isn't that what f6 does, apart from the final conversion to a list again?

Lawrence Oluyede August 14, 2006

Right. I totally missed f6()

none August 14, 2006

It would have been interesting to tests against more complex objects which redefines __cmp__.

- Sylvain

PS: Your comment system seems to be broken as I currently have the name and email fields filled up with Lawrence details.

Peter Bengtsson August 14, 2006

Thanks for the bug report. I'm working on fixing it. It's due to the fact that I've started caching the pages on a proxying server. Thanks.

Peter Bengtsson August 14, 2006

Should be fixed now.

M Lauer August 14, 2006

problem with the previous comment:
>>> lst = ['a','a','b',1,1,2,3,4,5]
>>> set(lst)
set(['a', 1, 2, 'b', 4, 5, 3])

Justin August 14, 2006

looks like you are using Set from python2.3 and not set from python2.4

djo August 14, 2006

Peter - worth reading this:

http://www.joelonsoftware.com/items/2006/08/01.html

Short version: map() can be trivially parallelized, for() can't. Functional-tasting solutions are going to run faster in multi-processor environments. Worth considering.

Martijn Faassen August 15, 2006

So how would you parallelize uniqify, trivially or not?

djo August 15, 2006

That's a good point; I just jumped in with something I thought was useful when I saw map(), the comparison implicit in __setitem__ passed me by. Using map() to generate side-effects is pretty alien to me.

In that case, I guess I'd do something based loosely on quicksort. It's nearly there already, it simply needs to throw away duplicate results when concatenating the child arrays (easy to do when the child arrays are guaranteed in-order and unique), and because it's a divide and conquer algorithm it should lend itself to a parallel environment.

ferringb August 15, 2006

not much point in parallelizing sorting unless # of elements are high enough that the splitting is actually worth it; plus at least for python, there still is the gil to worry about...

Brantley Harris August 15, 2006

This comes up often enough for me that I think there should be a built-in function.

Andrew Dalke August 15, 2006

Your benchmark code likely doesn't dowhat you think it does for the current case. The 'X' instances always compare different so everything is unique.

Here's my example, with a bit of a cheat to make the "idfun=None' case faster. "Using "." for leading spaces because I can't figure out how to put Python code in this comment system

def f7(seq, idfun=None):
....return list(_f7(seq, idfun))
def _f7(seq, idfun=None):
....seen = set()
....if idfun is None:
........for x in seq:
............if x in seen:
................continue
............seen.add(x)
............yield x
....else:
........for x in seq:
............x = idfun(x)
............if x in seen:
................continue
............seen.add(x)
............yield x

Since your benchmark didn't test that case I figured I could ignore it. :) The timing numbers I get are

* f2 66.65
* f4 66.13
* f5 2.19
* f7 1.91
f1 1.06
f3 0.97
f6 0.99

and these are in line with my own benchmark. Function call overhead in Python is high. Most of the performance difference comes from calling idfun.

I also don't incur the overhead of doing the list.append lookup for each new element. The list making is all in C code. There is overhead for the generator but that's pretty fast. you may also end up prefering an iterator solution over making full lists.

Regarding the parallelization of 'map'. That assumes a pure functional environment. Consider

>>> class Counter(object):
... def __init__(self): self.count = 0
... def __call__(self, x):
... i = self.count; self.count += 1; return i
...
>>> map(Counter(), [9, 8, 3, 6, 1])
[0, 1, 2, 3, 4]
>>>

Manuzhai August 15, 2006

I'd say that list(set([...])) is short enough? Or you could just work from the set directly. And I think the built-in set got a lot faster going from 2.3 to 2.4, due to it being rewritten in C.

Fuzzyman August 16, 2006

Even in Python 2.4 the built-in set is based on the Python dictionary. This is why the speed results were so similar.

In Python 2.5 it has been optimised further with a custom internal data-structure, and *should* be faster.

Fuzzyman

Peter Bengtsson August 16, 2006

but not ordered unfortunately.

Dave Kirby August 15, 2006

firstly "set" is a built in type in 2.4, and it is bad form to name variables the same as existing types or modules - it can lead to confusing code and subtle bugs.

Secondly, the sets.Set class used in f6 is implemented in Python, both in 2.3 and 2.4, while the set builtin type is implemented in C so should be significantly faster.

Here is alternative order-preserving function. I have not timed it, but it should be pretty fast:

def f7(seq):
seen = set()
return [ x for x in seq if x not in seen and not seen.add(x)]

Michael Urman August 15, 2006

What's most interesting is how changing the data seriously changes the actual performance. So if you know what your data looks like, you might be able do better than f5. If you don't know, f5 is probably a safe bet.

For instance, by changing the data to list('abcab'), and running the test 1000 times instead of 10, f2 becomes second fastest, and almost twice as fast as f5:
* f2 0.14
* f4 0.22
* f5 0.28
f1 0.16
f3 0.12
f6 0.45

Jack Diederich August 15, 2006

Kirby had the right idea but his implementation is more complicated than it needs to be.

def f7(seq): return list(set(seq))

You can drop the wrapping list() if you just need a sequence and not a real list. In that case the function is just

f7 = set

Charlie Gunyon August 16, 2006

Alright, so my list of changes:

Don't use sets.Set, use set built-in for all functions.

Use Andrew Dalke's f5() (though returning a generator is not what was required, reworked to return a list using list.append)

Use my f7(), which is:

def f7(seq):
....return {}.fromkeys(seq).keys()

Use Dave Kirby's f7() as f8().

Modify the test case to be huge:

blahlist = range(100000)
testdata = map(lambda a: a % 3, blahlist) + blahlist

Stop using f2() and f4(), they lose every time.

Results (old test case on the right):

* f5 54.688 --- 0.328
* f8 43.797 --- 0.297
f1 32.047 --- 0.156
f3 24.14 --- 0.157
f6 13.531 --- 0.14
f7 14.251 --- 0.156

So it looks like using the set built-in beats both sets.Set() and dict.

The other thing is there's probably very few cases where you'd want to unique-ify a list AND care about the order of that list at the same time. Removing duplicate transmissions is in that category (probably ought to fix your protocol if that's the case, though), but if there's others I can't think of them.

Anonymous August 16, 2006

If you recode the block

if marker in seen: continue
seen[marker] = 1
result.append(item)

to

if marker not in seen:
____seen[marker] = 1
____result.append(item)

the performance of f5 doubles!

Anonymous August 17, 2006

fastest non preserving (?):

def f7(seq):
____S,L = set, list
____return S(L(seq))

Jiri Vit August 18, 2006

Speed tip:
make local references in functions for builtins

def foo():
____L = []
____AppendToListL = L.append
...
...
____AppendToListL(item)

thesamet August 20, 2006

The results just confirm a known strategy to optimize python code. Replace python logic with C implementation that does the same. If it's built-in, the better. :)

Mike Watkins August 21, 2006

if the nicely simple list(set(seq)) appeals, then this variation may as well as its fast and preserves order:

def f13(seq):
# order preserving
....uniq = set(seq)
....return [item for item in seq if item in uniq]

Mike Watkins August 21, 2006

oops. never mind the above... should really have a coffee first (and write a test case too ...) Dave Kirby's version gets a +1 from me.

Mike Watkins August 21, 2006

The coffee-corrected return for f13 (above) should be:

return [item for item in seq if item in uniq and not uniq.remove(item)]

Essentially the inverse of Dave Kirby's approach but just a hair faster for whatever reason set() only knows.

nn August 22, 2006

My bet is on the memory allocation strategy. Adding to a set repeatedly allocates more memory for the set. I guess removing from the set does not deallocate memory until the end of the function.

Sven Berkvens-Matthijsse October 12, 2011

The (small) disadvantage over Mike Watkins' version when compared to Dave Kirby's version is that Mike's version walks the given sequence twice. That means that it cannot be a generator, while Dave's version will handle generators just fine.

Maxime Biais February 14, 2007

I always use this one (order preserving, generative, concise, but don't work on no hashable).

from itertools import ifilter
def _f14(iterable):
....m = set()
....return ifilter(lambda x: not (m.__contains__(x) or m.add(x)), iterable)

def f14(lst):
....return list(_f14(lst))

My results:
* f5 1.94
* f5b 1.89
* f8 1.18
* f10 1.24
* f11 1.23
* f13 1.31
* f14 1.87
f1 0.67
f3 0.64
f6 0.68
f7 0.38
f9 0.41

Not so bad ;)

MizardX January 22, 2009

I tried optimizing the top competitors:

* f8 5.239
* f8b 4.415 <--
* f8c 4.971
* f11 5.692
* f11b 4.882
* f11c 4.713 <--
* f12 5.76
* f12b 4.911 <--

f12 is Mike Watkins' function.

f8b, f11b, f12b is unchanged except for storing a reference to seen.add/uniq.remove

f8c also tried to store a reference to seen.__contains__, which back-fired.

In f11c I tried to change if-continue-statements to if-not-statements, as Anonymous sugested.

Optimizing attribute lookup can give 15% speedup.
Removing continue statements can give another 5%.

Changing the in-operator to a call to __contains__ removes 10%.

zoof June 17, 2009

I noticed that these functions don't work with lists of lists (i.e., lists are not hashable) -- is there an easy way to implement these functions for lists of lists or does that quickly become intractable (e.g., lists of lists of lists).

adrian July 8, 2009

thanks for this! saved me time testing speed on these various options.

Leigh Gordon July 24, 2009

I use this function on a script that needs to work on both version 2.3 and higher versions(and doesn't care about preserving order).

def unique(seq):
try:
return list(set(seq))
except NameError:
return {}.fromkeys(seq).keys()

Anonymous August 27, 2009

python newbie here..i need to unify an input, could someone help me with this?
i do not understand how to use any of these functions in a program.

testasdf1 September 17, 2009

>>>li=['a', 'mpilgrim', 'foo', 'b', 'c', 'b', 'd', 'd']
>>>[x for x in li if x not in locals()['_[1]']]

output:

['a', 'mpilgrim', 'foo', 'b', 'c', 'd']

While, I can't explain how locals()['_[1]'] works....

qman November 5, 2009

I could explain it, but then I'd have to kill you...

Rick September 28, 2009

cool

Anonymous November 14, 2009

Returns a unique list, sorted. (Speed is medium.)

[a for a,b in itertools.groupby(sorted(seq))]

It's f12 in the results below:
* f5 90.08
* f5b 87.62
* f8 15.004
* f10 56.096
* f11 59.153
f1 5.546
f3 21.098
f6 31.611
f7 3.47
f9 4.584
f12 25.881

Anonymous November 16, 2009

what about

dict(zip(seq,seq)).keys()

David December 11, 2009

why on earth would you do that when you could do list(set(seq)) ??

Anonymous January 26, 2010

Perhaps you would want the ordering?

Lynwood January 29, 2010

This is absolutely ingenious! Thank you for sharing!

Eddie April 14, 2010

Warning, this method is not guaranteed to be order-preserving. See http://docs.python.org/library/stdtypes.html#dict.items -- "CPython implementation detail: Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionarys history of insertions and deletions." -- Note it says "arbitrary" and it is implementation-dependent.

Mark March 13, 2010

Thanks for sharing!

J Wolter April 20, 2010

In your updated code (http://www.peterbe.com/plog/uniqifiers-benchmark/uniqifiers_benchmark.py), I notice that f11 calls f10. Is this an error? Did you run it this way?

Peter Bengtsson April 20, 2010

Must be a typo. Haven't looked at this for a long time.

Greg Lindahl August 31, 2016

The typo is still there, 6 years later!

juanpi May 6, 2010

f1 apparently doesn't work on lists of tuples.
It would be nice to have a generalized version.

[(0, 1), (1, 2), (3, 2), (2, 1), (2, 0), (2, 3), (1, 0), (0, 2)]

lapo3399 June 2, 2010

This seems to work on any list, including one of lists or tuples:

[[elem,already.append(elem)][0] for elem in thisList if elem not in already]

Now, I'm not sure if it's bad practise to call a function in there sneakily like that, but it works ;).

Davidog February 22, 2011

I'm a newbie to Python, learning a whole new set of skills and programming methods. This page was a complete mini- education in itself.

My thanks to everyone who participated in the dialog, and especially our host, Peter Bengtsson.

"I am not worthy!" (Mike Myers and Dana Carvey)

Ming Huang April 5, 2011

I find a way to get the unique list

A = [1,1,2,2,3,3]
A = [i for i in set(A)]

Joe Jiang April 28, 2011

>>> aux = {}; [aux.setdefault(p, p) for p in list('ABeeE') if p not in aux];
['A', 'B', 'e', 'E']

Sergey Shepelev June 25, 2011

Thank you very much for this collection of functions. However, i
needed fast version with key (id) function that doesn't preserve
order, here's what i came up with:

def f12(seq, idfun):
return dict((idfun(x), x) for x in seq).values()

timing(f12, 100, testdata, len)

Results on my box:
* f8 2.43
* f10 2.5
* f11 2.6
f3 1.56
f7 1.1
f9 1.21
f12 1.8

Mike Hoy August 17, 2011

Thanks for this write up. It was very useful this morning.

Anonymous September 28, 2011

f1 0.0
o.O ? It finishes in 0.0 seconds? I think f1 is the fastest :)

Lars Hupfeldt November 3, 2011

Hi, I found a bug in the f10 version. When calling with the idfun argument it returns a list of the id objects instead of the original objects.
...
: else:
:: for x in seq:
::: x = idfun(x)
::: if x in seen:
:::: continue
::: seen.add(x)
::: yield x
-----
should be:
...
: else:
:: for x in seq:
::: xi = idfun(x)
::: if xi in seen:
:::: continue
::: seen.add(xi)
::: yield x

Dana Woodman March 9, 2012

FYI: You seem to have broken the download links for the script when migrating to your new blog.

hasan bardak April 19, 2012

Hello,
Very nice blog and post :D, the download link seems to be broken. Could you check that?
Cheers

Peter Bengtsson April 19, 2012

Fixed now. Thanks for noticing.

James Robert April 26, 2012

My personal favorite:
(preserves order)

def uniq(iterable, idfunc=lambda x:x):
. . . . seen = set()
. . . . return [seen.add(idfunc(x)) or x for x in iterable if idfunc(x) not in seen]

Anonymous May 8, 2012

* f5 4.846
* f5b 4.93
* f8 3.316
* f10 3.643
* f11 3.638
f1 2.064
f3 1.63
f6 1.935
f7 1.072
f9 1.199

Anonymous May 8, 2012

FYI: small spelling error ==> uniqifiers should be uniquifiers

eseprimo May 26, 2012

Did anyone address the issues with non hashable types? Note that sets and non hashable types don't go well together---which renders useless most of the functions above, except when dealing with trivial data. For example,

>>> alist=['a',[1,3,4],[3,3,3],[2,'b'],'a']
>>> set(alist)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Anonymous June 14, 2012

Not that hard. Some of the ordering is changed but that could be understandable since we are removing dups.

tp = lambda x: str(type(x)).split("'")[1].split("'")[0].replace("'",'')
alist=['a',[1,3,4],[3,3,3],[2,'b'],'a']
blist = [ i if tp(i) != 'list' else [ j for j in set(i) ] for i in alist ]
print("Alist: {0}\nBList: {1}".format(alist,blist))

Alist: ['a', [1, 3, 4], [3, 3, 3], [2, 'b'], 'a']
BList: ['a', [3, 1, 4], [3], ['b', 2], 'a']

Anonymous June 14, 2012

Not too hard. Bit ugly but working.

tp = lambda x: str(type(x)).split("'")[1].split("'")[0].replace("'",'')
alist=['a',[1,3,4],[3,3,3],[2,'b'],'a']
blist = [ i if tp(i) != 'list' else [ j for j in set(i) ] for i in alist ]
clist = list([ i for i in set( [ i if tp(i) != 'list' else '!None' for i in blist ] ) ])[:-1] +[ i for i in blist if tp(i)=='list' ]
print("Alist: {0}\nBList: {1}\nClist: {2}".format(alist,blist, clist))

grd August 3, 2012

order set:

def UniqueList(List):
    unique_set = sets.Set()
    unique_list = []
    for n in List:
        if n not in unique_set:
            unique_set.add(n)
            unique_list.append(n)

    return unique_list

nasgar November 14, 2012

why not?
def f5(self, seq, idfun=lambda x : x):

and remove:

if idfun is None:
def idfun(x): return x

Peter Bengtsson November 15, 2012

Because that makes it possible to optionally override what the key should be. Some items aren't hashable. E.g. dicts.

Harsh December 4, 2012

Peter - nasgar's merely replacing your identity function plug with an inline lambda in the function def. I think that is valid to do (achieves the same thing), but does look harder to read.

kmbt March 5, 2013

Hi. You have an error in your benchmark in the function
def f11(seq): # f10 but simpler
# Order preserving
return list(_f10(seq))

It should say _f11 instead of _f10 in the last line.
Cheers!

Ely Spears September 11, 2013

I'm curious if the recursive solution here: (http://stackoverflow.com/a/18729193/567620) is comparable for average-case kinds of lists (where number of unique elements is relatively small compared to total number of elements)?

This was inspired by Haskell's `nub` function. It would be cool to add it to the collection of uniquifiers in the main post and explore some timing about it. Below is the code for good measure:

def unique(lst):
    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x != lst[0], lst[1:]))

I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:

import operator
def unique(lst, cmp_op=operator.ne):
    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)

For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:

def test_round(x,y):
    return round(x) != round(y)

then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:

In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]

Aaron Swan January 28, 2014

I needed the index of the unique elements in an order preserved manner, so I made a small modification to f8:

seen = set()
[i for (i,x) in enumerate(seq) if x not in seen and not seen.add(x)]

Hope this is helpful to someone.

Great post. Thank you.

toddrjen March 23, 2014

Would it be possible to create a github or other online repo for this?

Also, what is the license of the code?

Murali April 20, 2014

hi,

Thanks a ton for the snippet. f5 works like a charm...and it's fast too!

M & M

Jane October 21, 2014

What about lists of dictionaries like this:
all = [ {'id': '001', 'age': '12'}, {'id': '533', 'info': 'student'}, {'id': '001', 'age': '12'} ]

qemi October 23, 2014

Asking toddrjens question again. What license is the code listed on this page?

Peter Bengtsson August 21, 2015

Free for all.

Satish November 25, 2014

Hi,

Thank you. f5 helped a lot!

Cornelis August 21, 2015

Sadly f5 does not work for a list of lists

Joshua September 15, 2015

I am new to python and have run into a serious problem that is causing me to grey prematurely. The problem that I am currently running into is that I am trying to pull unique IP addresses from a list of IPs that often have duplicates. What I have read above seems to be a list populated by the user. How would you do this if the list is prepopulated text file? Right now this is what my script looks like for pulling unique IPs out:

def uniq_count(item):
      seen = set()
      for item in item:
           seen.add(ips)
           yield len(seen)

in small batches it works but when the list is larger it seems to spit out two of each. Any help would be appreciated.

Peter Bengtsson September 23, 2015

`for item in item`??
Is that a typo.

If you want to count unique occurances of an iterator you can't return a generator. Something like this should work:

def uniq_count(iterator):
    seen = set()
    for item in iterator:
        seen.add(item)
    return len(seen)

And you'd be able to use it something like this:

print uniq_count(open('some.log'))

However, there's an even simpler way. Suppose you have a list of things. Like ['a', 'b', 'a', 'c']
Then, to get the count of unique elements you simply convert it to a set like this:

print len(set(['a', 'b', 'a', 'c'])) # will print 3

Shane October 6, 2015

Hi All,

I added a new variant of f5 / f5b (see below) that simply swaps the seen items being a dict to use a set. The thinking amongst myself, and colleagues, is a set() would be faster ... running the tests (simply adding below into the downloadable benchmark py) showed the dict (both f5 and f5b) is faster by 10-20% across python versions 2.6 - 3.1 ... does anyone else find this counter intuitive? and have a reason why this is so? (oh, for py3 testing I did remove f7 and used print function too)

# BEGIN f5c
def f5c(seq, idfun=None): # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = set()
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker not in seen:
            seen.add(marker)
            result.append(item)

    return result
# END f5c

TIA,

Lukáš January 13, 2016

Hello guys, the ordered variant can be even faster, if you change the condition:

```
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]
```

and another quite important optimization is to avoid __getattr__ call by storing the actual used function:

```
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]
```

Dody January 17, 2016

in the example of the list above it mentioned that function f5 is the best solution. I thought that the fastest is f3 i do not understand why it shows f5. is it a typo? please explain

James Koss February 13, 2016

Wow, just wow. Using a dict{} to check duplicates is brilliant. So fast, so obviously faster than a repeat check in list. Well done!

www.JamesKoss.com

Anonymous March 29, 2016

added https://gist.github.com/katakumpo/6931e18dcd0cc4c20f8f8f324cea6056 for python 3 compatibility and also with orderedset package if installed

Andrew April 5, 2016

Hi all, I discovered this page while trying to write a simple one-liner to de-dupe lists with order preservation. I came up with the following using a list comprehension and slicing. The idea is to slice the list one index at a time and only add an element to the de-duped list if it does not yet exist in the slice. I'm not sure about performance but as far as readability goes I like the simplicity of this one.

series = [v for i, v in enumerate(series) if series[0:i].count(v) == 0]

none :) January 17, 2017

Joking m8

Ben Truscott February 12, 2017

I'd just like to point out that the premise of the test is somewhat flawed, because the various methods have different asymptotic complexity. f5 is fast for short inputs with many duplicates, but it scales poorly, so cannot be said to be universally the best approach. I came across this page while thinking about how to identify (rather than eliminate) duplicates in numeric arrays using NumPy, but after I'd arrived at something several times faster for large inputs. Still, the presentation and comparison between the algorithms is interesting.

Eric Werner July 25, 2018

Thats probably also why Peter wrote "best". And have you seen the script code he linked? There is lots to play around with length of items and loop counts!
It would be nice to see this tested against a reference at least once per testdata batch. Then with different lengths as well. Far too much for a small blog post like this.

I just stumbled across this again and tried out the "new" entries.
I was also amaaazed how incredibly FAST f1 became in Python 3.
then i looked a little closer:

testdata: ['f', 'g', 'c', 'd', 'b', 'a', 'a']
result: dict_keys([]) 😐

Previous:: Unicode strings to ASCII ...nicely August 8, 2006 Python
Next:: Madonna's birthday party at the Lounge Lover August 17, 2006 Photos

Related by category:: A Python dict that can report which keys you did not use June 12, 2025 Python; Combining Django signals with in-memory LRU cache August 9, 2025 Python; Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python

Related by keyword:: Combining Django signals with in-memory LRU cache August 9, 2025 Python, Django; Fastest way to uniquify a list in Python >=3.6 December 23, 2017 Python; Benchmark compare Highlight.js vs. Prism May 19, 2020 Node, JavaScript; mincss "Clears the junk out of your CSS" January 21, 2013 Python, Web development

Go to top of the page

Fastest way to uniqify a list in Python

Comments

Related posts