tl;dr; SELECT COUNT(*) FROM (SELECT DISTINCT my_not_unique_indexed_column FROM my_table) t;
I have a table that looks like this:
songsearch=# \d main_songtexthash Table "public.main_songtexthash" Column | Type | Collation | Nullable | -----------+--------------------------+-----------+----------+ id | integer | | not null | text_hash | character varying(32) | | not null | created | timestamp with time zone | | not null | modified | timestamp with time zone | | not null | song_id | integer | | not null | Indexes: "main_songtexthash_pkey" PRIMARY KEY, btree (id) "main_songtexthash_song_id_key" UNIQUE CONSTRAINT, btree (song_id) "main_songtexthash_text_hash_c2771f1f" btree (text_hash) "main_songtexthash_text_hash_c2771f1f_like" btree (text_hash varchar_pattern_ops) Foreign-key constraints: ...snip...
And the data looks something like this:
songsearch=# select text_hash, song_id from main_songtexthash limit 10; text_hash | song_id ----------------------------------+--------- 6f98e1945e64353bead9d6ab47a7f176 | 2565031 0c6662363aa4a340fea5efa24c98db76 | 486091 a25af539b183cbc338409c7acecc6828 | 212 5aaf561b38c251e7d863aae61fe1363f | 2141077 6a221df60f7cbb8a4e604f87c9e3aec0 | 245186 d2a0b5b3b33cdf5e03a75cfbf4963a6f | 1453382 95c395dd78679120269518b19187ca80 | 981402 8ab19b32b3be2d592aa69e4417b732cd | 616848 8ab19b32b3be2d592aa69e4417b732cd | 243393 01568f1f57aeb7a97e2544978fc93b4c | 333 (10 rows)
If you look carefully, you'll notice that every song_id
has a different text_hash
except two of them.
Song IDs 616848
and 243393
both have the same text_hash
of value 8ab19b32b3be2d592aa69e4417b732cd
.
Also, if you imagine this table only has 10 rows, you could quickly and easily conclude that there are 10 different song_id
but 9 different distinct text_hash
. However, how do you do this counting if the tables are large??
The Wrong Way
songsearch=# select count(distinct text_hash) from main_songtexthash; count --------- 1825983 (1 row)
And the explanation and cost analysis is:
songsearch=# explain analyze select count(distinct text_hash) from main_songtexthash; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=44942.09..44942.10 rows=1 width=8) (actual time=40029.225..40029.226 rows=1 loops=1) -> Seq Scan on main_songtexthash (cost=0.00..40233.87 rows=1883287 width=33) (actual time=0.029..193.653 rows=1879521 loops=1) Planning Time: 0.059 ms Execution Time: 40029.250 ms (4 rows)
Oh noes! A Sec Scan
! Run!
The Right Way
Better explained in this blog post but basically, cutting to the chase, here's how you count on an indexed field:
songsearch=# select count(*) from (select distinct text_hash from main_songtexthash) t; count --------- 1825983 (1 row)
And the explanation and cost analysis is:
songsearch=# explain analyze select count(*) from (select distinct text_hash from main_songtexthash) t; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=193871.20..193871.21 rows=1 width=8) (actual time=4894.555..4894.556 rows=1 loops=1) -> Unique (cost=0.55..172861.54 rows=1680773 width=33) (actual time=0.075..4704.741 rows=1825983 loops=1) -> Index Only Scan using main_songtexthash_text_hash_c2771f1f on main_songtexthash (cost=0.55..168153.32 rows=1883287 width=33) (actual time=0.074..4132.822 rows=1879521 loops=1) Heap Fetches: 1879521 Planning Time: 0.082 ms Execution Time: 4894.581 ms (6 rows)
Same exact result but ~5s instead of ~40s. I'll take that, thank you very much.
The Django Way
As a bonus: Django is smart. Here's how they do it:
>>> SongTextHash.objects.values('text_hash').distinct().count()
1825983
And, the SQL it generates to make that count looks very familiar:
SELECT COUNT(*) FROM (SELECT DISTINCT "main_songtexthash"."text_hash" AS Col1 FROM "main_songtexthash") subquery
Conclusion
- Avoid "sequential scans" like the plague if you care about performance (...or not just killing your resources).
- Trust in Django.
Comments
Django is smart, whoa. But this should be fixed in Postgres so that not *every developer using sql* has to learn the work around...I mean I think I even had to learn this the hard way myself once. Come on Postgres!
Which PostgreSQL version did you use?
I think you'll find that
select count(*) from (select text_hash from main_songtexthash group by text_hash) t;
is even faster. Although it might depend on your dataset.