Don't forget your sets in Python!

Friday, Mar 10, 2017
4 comments Python

I had this piece of code:


new_all_ids = set(
    x for x in all_ids if x not in to_process_ids
)

The all_ids is a list object of 1.1 million IDs. Some repeated. And to_process_ids is a list sample of 1,000 randomly selected IDs from that list but all unique. The objective of the code was to first extract 1,000 IDs, do something when them, then remove those 1,000 from the original list and once that's done, update all_ids back with those from to_process_ids removed.

Only problem was that I noticed that this operation took 30+ seconds! How can it take 30 seconds to do a little bit of list comprehension? The explanation is the lack of index lookups.

What the code actually does is this:


new_all_ids = set()
for x in all_ids:
    not_found = False
    for y in to_process_ids:
        if y != x:
            not_found = True
    if not_found:
         new_all_ids.add(x)

Can you see it? It's doing a nested loop. Or, O(n^2) in computer lingo. That's 1.1 million * 1,000 iterations of that conditional.

What sets do, on the other hand, is that they convert the list of keys in the set to a hash table. Kinda similar to how dict works except it doesn't point to a value. That means that asking a set if it contains a certain key is a O(1) operation.

You might have spotted the solution already.
If you instead do:


to_process_ids_set = set(to_process_ids)
new_all_ids = set(
    x for x in all_ids if x not in to_process_ids_set
)

Now, in my particular example, instead of taking 30+ seconds this implementation only takes 0.026 seconds. 1,000 times faster.

You might also have noticed that there's a much more convenient way to do this very same thing, namely set operations! The desired new_all_ids is going to become a set anyway. And if we can first convert it to a set, then do the operation on it we can avoid looping over repeats as we do the checking.

Final solution:


new_all_ids = set(all_ids) - set(to_process_ids)

You can try it yourself with this little benchmark:


"""
Demonstrate three ways how to reduce a non-unique list of integers
by 1,000 randomly selected unique ones.
Demonstrates the huge difference between lists and set lookups.
"""
import time
import random


# Range numbers chosen quite arbitrarily to get a benchmark that lasts
# a couple of seconds with a realistic proportion of unique and 
# repeated integers.
items = 200000
all_ids = [random.randint(1, items * 2) for _ in range(items)]

print('About', 100. * len(set(all_ids)) / len(all_ids), '% unique')

to_process_ids = random.sample(set(all_ids), 1000)


def f1(to_process_ids):
    return set(x for x in all_ids if x not in to_process_ids)


def f2(to_process_ids):
    to_process_ids_set = set(to_process_ids)
    return set(x for x in all_ids if x not in to_process_ids_set)


def f3(to_process_ids):
    return set(all_ids) - set(to_process_ids)


functions = [f1, f2, f3]
results = {f.__name__: [] for f in functions}
for i in range(10):
    random.shuffle(functions)
    for f in functions:
        t0 = time.time()
        f(to_process_ids)
        t1 = time.time()
        results[f.__name__].append(t1 - t0)

for function in sorted(results):
    print(function, sum(results[function])/ len(results[function]))

When I run that with my Python 3.5.1 I get:

About 78.616 % unique
f1 3.9494598865509034
f2 0.041156983375549315
f3 0.02245485782623291

Seems to match expectations.

Comments

Mark Tranchant March 10, 2017

"Now, in my particular example, instead of taking 30+ seconds this implementation only takes 0.026 seconds. 100 times faster. "

1000 times faster. #pedantryrules

Peter Bengtsson March 30, 2017

Thank you! Fixed now.

Chris Adams March 11, 2017

I like the solution you ultimately ended up with but tend to write it in this form:

```
new_all_ids = set(all_ids).difference(to_process_ids)
```

In the example as written that's not too much of a difference but I find I like the reminder that it's a set which the .difference() call provides, and it allows me to be lazy about converting the second argument to a set first since difference will also accept a list, tuple, generator, Django values_list(), etc.

Ski July 4, 2017

You can also use set-comprehension

myset = {x for x in ...}

Previous:: crontabber now supports locking, both high- and low-level March 4, 2017 Python, Mozilla, PostgreSQL
Next:: Web Performance Optimization of a Single-Page-App and web fonts March 16, 2017 Web development, React, JavaScript

Related by category:: A Python dict that can report which keys you did not use June 12, 2025 Python; Native connection pooling in Django 5 with PostgreSQL June 25, 2025 Python; How I run standalone Python in 2025 January 14, 2025 Python; How to resolve a git conflict in poetry.lock February 7, 2020 Python

Related by keyword:: Fastest way to uniqify a list in Python August 14, 2006 Python; mincss "Clears the junk out of your CSS" January 21, 2013 Python, Web development; Django ORM optimization story on selecting the least possible February 22, 2019 Python, Web development, Django, PostgreSQL; Fastest way to match a filename's extension in Python August 31, 2017 Python

Don't forget your sets in Python!

Comments

Related posts