Fastest way to uniqify a list in Python

Monday, Aug 14, 2006

⬅︎ Back to Fastest way to uniqify a list in Python

Comment

Andrew Dalke August 15, 2006

Your benchmark code likely doesn't dowhat you think it does for the current case. The 'X' instances always compare different so everything is unique.

Here's my example, with a bit of a cheat to make the "idfun=None' case faster. "Using "." for leading spaces because I can't figure out how to put Python code in this comment system

def f7(seq, idfun=None):
....return list(_f7(seq, idfun))
def _f7(seq, idfun=None):
....seen = set()
....if idfun is None:
........for x in seq:
............if x in seen:
................continue
............seen.add(x)
............yield x
....else:
........for x in seq:
............x = idfun(x)
............if x in seen:
................continue
............seen.add(x)
............yield x

Since your benchmark didn't test that case I figured I could ignore it. :) The timing numbers I get are

* f2 66.65
* f4 66.13
* f5 2.19
* f7 1.91
f1 1.06
f3 0.97
f6 0.99

and these are in line with my own benchmark. Function call overhead in Python is high. Most of the performance difference comes from calling idfun.

I also don't incur the overhead of doing the list.append lookup for each new element. The list making is all in C code. There is overhead for the generator but that's pretty fast. you may also end up prefering an iterator solution over making full lists.

Regarding the parallelization of 'map'. That assumes a pure functional environment. Consider

>>> class Counter(object):
... def __init__(self): self.count = 0
... def __call__(self, x):
... i = self.count; self.count += 1; return i
...
>>> map(Counter(), [9, 8, 3, 6, 1])
[0, 1, 2, 3, 4]
>>>