Best EXPLAIN ANALYZE benchmark script

19 April 2018   0 comments   Python, PostgreSQL

tl;dr; Use to benchmark a SQL query in Postgres.

I often benchmark SQL by extracting the relevant SQL string, prefix it with EXPLAIN ANALYZE, putting it into a file (e.g. benchmark.sql) and then running psql mydatabase < benchmark.sql. That spits out something like this:

                                                           QUERY PLAN
 Index Scan using main_song_ca949605 on main_song  (cost=0.43..237.62 rows=1 width=4) (actual time=1.586..1.586 rows=0 loops=1)
   Index Cond: (artist_id = 27451)
   Filter: (((name)::text % 'Facing The Abyss'::text) AND (id <> 2856345))
   Rows Removed by Filter: 170
 Planning time: 3.335 ms
 Execution time: 1.701 ms
(6 rows)

Cool. So you study the steps of the query plan and look for "Seq Scan" and various sub-optimal uses of heaps and indices etc. But often, you really want to just look at the Execution time milliseconds number. Especially if you might have to slightly different SQL queries to compare and contrast.

However, as you might have noticed, the number on the Execution time varies between runs. You might think nothing's changed but Postgres might have warmed up some internal caches or your host might be more busy or less busy. To remedy this, you run the EXPLAIN ANALYZE select ... a couple of times to get a feeling for an average. But there's a much better way!

Check this out:

Download it into your ~/bin/ and chmod +x ~/bin/ I wrote it just this morning so don't judge!

Now, when you run it it runs that thing 10 times (by default) and reports the best Execution time, its mean and its median. Example output:

▶ songsearch dummy.sql
    BEST    1.229ms
    MEAN    1.489ms
    MEDIAN  1.409ms
    BEST    1.994ms
    MEAN    4.557ms
    MEDIAN  2.292ms

The "BEST" is an important metric. More important than mean or median.

Raymond Hettinger explains it better than I do. His context is for benchmarking Python code but it's equally applicable:

"Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."


Your email will never ever be published

Related posts

Efficient many-to-many field lookup in Django REST Framework 11 April 2018
The impressive first-meaningful-paint improvement of using minimalcss 24 April 2018
Related by Keyword:
Show size of every PostgreSQL database you have 07 February 2018
Conditional aggregation in Django 2.0 12 January 2018
When Docker is too slow, use your host 11 January 2018
Fastest *local* cache backend possible for Django 04 August 2017
How to do performance micro benchmarks in Python 24 June 2017
Related by Text:
jQuery and Highslide JS 08 January 2008
I'm back! has been renewed 05 June 2005
Anti-McCain propaganda videos 12 August 2008
I'm Prolog 01 May 2007
Ever wondered how much $87 Billion is? 04 November 2003