Best way to count distinct indexed things in PostgreSQL

21 March 2019   1 comment   Django, PostgreSQL

tl;dr; SELECT COUNT(*) FROM (SELECT DISTINCT my_not_unique_indexed_column FROM my_table) t;

I have a table that looks like this:

songsearch=# \d main_songtexthash
            Table "public.main_songtexthash"
  Column   |           Type           | Collation | Nullable |
-----------+--------------------------+-----------+----------+
 id        | integer                  |           | not null |
 text_hash | character varying(32)    |           | not null |
 created   | timestamp with time zone |           | not null |
 modified  | timestamp with time zone |           | not null |
 song_id   | integer                  |           | not null |
Indexes:
    "main_songtexthash_pkey" PRIMARY KEY, btree (id)
    "main_songtexthash_song_id_key" UNIQUE CONSTRAINT, btree (song_id)
    "main_songtexthash_text_hash_c2771f1f" btree (text_hash)
    "main_songtexthash_text_hash_c2771f1f_like" btree (text_hash varchar_pattern_ops)
Foreign-key constraints:
    ...snip...

And the data looks something like this:

songsearch=# select text_hash, song_id from main_songtexthash limit 10;
            text_hash             | song_id
----------------------------------+---------
 6f98e1945e64353bead9d6ab47a7f176 | 2565031
 0c6662363aa4a340fea5efa24c98db76 |  486091
 a25af539b183cbc338409c7acecc6828 |     212
 5aaf561b38c251e7d863aae61fe1363f | 2141077
 6a221df60f7cbb8a4e604f87c9e3aec0 |  245186
 d2a0b5b3b33cdf5e03a75cfbf4963a6f | 1453382
 95c395dd78679120269518b19187ca80 |  981402
 8ab19b32b3be2d592aa69e4417b732cd |  616848
 8ab19b32b3be2d592aa69e4417b732cd |  243393
 01568f1f57aeb7a97e2544978fc93b4c |     333
(10 rows)

If you look carefully, you'll notice that every song_id has a different text_hash except two of them.
Song IDs 616848 and 243393 both have the same text_hash of value 8ab19b32b3be2d592aa69e4417b732cd.

Also, if you imagine this table only has 10 rows, you could quickly and easily conclude that there are 10 different song_id but 9 different distinct text_hash. However, how do you do this counting if the tables are large??

The Wrong Way

songsearch=# select count(distinct text_hash) from main_songtexthash;
  count
---------
 1825983
(1 row)

And the explanation and cost analysis is:

songsearch=# explain analyze select count(distinct text_hash) from main_songtexthash;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=44942.09..44942.10 rows=1 width=8) (actual time=40029.225..40029.226 rows=1 loops=1)
   ->  Seq Scan on main_songtexthash  (cost=0.00..40233.87 rows=1883287 width=33) (actual time=0.029..193.653 rows=1879521 loops=1)
 Planning Time: 0.059 ms
 Execution Time: 40029.250 ms
(4 rows)

Oh noes! A Sec Scan! Run!

The Right Way

Better explained in this blog post but basically, cutting to the chase, here's how you count on an indexed field:

songsearch=# select count(*) from (select distinct text_hash from main_songtexthash) t;
  count
---------
 1825983
(1 row)

And the explanation and cost analysis is:

songsearch=# explain analyze select count(*) from (select distinct text_hash from main_songtexthash) t;
                                                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=193871.20..193871.21 rows=1 width=8) (actual time=4894.555..4894.556 rows=1 loops=1)
   ->  Unique  (cost=0.55..172861.54 rows=1680773 width=33) (actual time=0.075..4704.741 rows=1825983 loops=1)
         ->  Index Only Scan using main_songtexthash_text_hash_c2771f1f on main_songtexthash  (cost=0.55..168153.32 rows=1883287 width=33) (actual time=0.074..4132.822 rows=1879521 loops=1)
               Heap Fetches: 1879521
 Planning Time: 0.082 ms
 Execution Time: 4894.581 ms
(6 rows)

Same exact result but ~5s instead of ~40s . I'll take that, thank you very much.

The Django Way

As a bonus: Django is smart. Here's how they do it:

>>> SongTextHash.objects.values('text_hash').distinct().count()
1825983

And, the SQL it generates to make that count looks very familiar:

SELECT COUNT(*) FROM (SELECT DISTINCT "main_songtexthash"."text_hash" AS Col1 FROM "main_songtexthash") subquery

Conclusion

Comments

ROGER DELOY PACK

Django is smart, whoa. But this should be fixed in Postgres so that not *every developer using sql* has to learn the work around...I mean I think I even had to learn this the hard way myself once. Come on Postgres!

Your email will never ever be published

Related posts

Previous:
Format numbers with numberWithCommas() or Number.toLocaleString() 05 March 2019
Next:
Optimize inlined SVG on developer.mozilla.org 04 April 2019