How to sort case insensitively with empty strings last in Django
April 3, 2022
1 comment Django, Python, PostgreSQL
Imagine you have something like this in Django:
class MyModel(models.Models):
last_name = models.CharField(max_length=255, blank=True)
...
The most basic sorting is either: queryset.order_by('last_name')
or queryset.order_by('-last_name')
. But what if you want entries with a blank string last? And, you want it to be case insensitive. Here's how you do it:
from django.db.models.functions import Lower, NullIf
from django.db.models import Value
if reverse:
order_by = Lower("last_name").desc()
else:
order_by = Lower(NullIf("last_name", Value("")), nulls_last=True)
ALL = list(queryset.values_list("last_name", flat=True))
print("FIRST 5:", ALL[:5])
# Will print either...
# FIRST 5: ['Zuniga', 'Zukauskas', 'Zuccala', 'Zoller', 'ZM']
# or
# FIRST 5: ['A', 'aaa', 'Abrams', 'Abro', 'Absher']
print("LAST 5:", ALL[-5:])
# Will print...
# LAST 5: ['', '', '', '', '']
This is only tested with PostgreSQL but it works nicely.
If you're curious about what the SQL becomes, it's:
SELECT "main_contact"."last_name" FROM "main_contact"
ORDER BY LOWER(NULLIF("main_contact"."last_name", '')) ASC
or
SELECT "main_contact"."last_name" FROM "main_contact"
ORDER BY LOWER("main_contact"."last_name") DESC
Note that if your table columns is either a string, an empty string, or null, the reverse needs to be: Lower("last_name", nulls_last=True).desc()
.
Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON
September 30, 2019
2 comments Redis, Nginx, Web Performance, Python, Django, PostgreSQL
Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.
However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.
But is it entirely wrong to look at WHICH database to get the best speed?
Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx
Nginx??
The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:
const response = await fetch('https://songsear.ch/api/song/5246889/amazon');
Internally, in the app, what it does is that it looks this up, by ID, on the AmazonAffiliateLookup
ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as ProductTypeName
, Studio
, TrackSequence
etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.
Again, Nginx?
Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.
Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:
▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes Summary: Total: 0.9882 secs Slowest: 0.0279 secs Fastest: 0.0001 secs Average: 0.0010 secs Requests/sec: 10119.8265 Response time histogram: 0.000 [1] | 0.003 [9752] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.006 [108] | 0.008 [70] | 0.011 [32] | 0.014 [8] | 0.017 [12] | 0.020 [11] | 0.022 [1] | 0.025 [4] | 0.028 [1] | Latency distribution: 10% in 0.0003 secs 25% in 0.0006 secs 50% in 0.0008 secs 75% in 0.0010 secs 90% in 0.0013 secs 95% in 0.0016 secs 99% in 0.0068 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0001 secs, 0.0279 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0026 secs req write: 0.0000 secs, 0.0000 secs, 0.0011 secs resp wait: 0.0008 secs, 0.0001 secs, 0.0206 secs resp read: 0.0001 secs, 0.0000 secs, 0.0013 secs Status code distribution: [200] 10000 responses
10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a Cache-Control
HTTP header with something like max-age=86400
.
Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long Cache-Control
, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).
Conclusion
If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.
How much faster is Redis at storing a blob of JSON compared to PostgreSQL?
September 28, 2019
65 comments Python, PostgreSQL, Redis
tl;dr; Redis is 16 times faster at reading these JSON blobs.
In Song Search when you've found a song, it loads some affiliate links to Amazon.com. (In case you're curious it's earning me lower double-digit dollars per month). To avoid overloading the Amazon Affiliate Product API, after I've queried their API, I store that result in my own database along with some metadata. Then, the next time someone views that song page, it can read from my local database. With me so far?
The other caveat is that you can't store these lookups locally too long since prices change and/or results change. So if my own stored result is older than a couple of hundred days, I delete it and fetch from the network again. My current implementation uses PostgreSQL (via the Django ORM) to store this stuff. The model looks like this:
class AmazonAffiliateLookup(models.Model, TotalCountMixin):
song = models.ForeignKey(Song, on_delete=models.CASCADE)
matches = JSONField(null=True)
search_index = models.CharField(max_length=100, null=True)
lookup_seconds = models.FloatField(null=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
modified = models.DateTimeField(auto_now=True)
At the moment this database table is 3GB on disk.
Then, I thought, why not use Redis for this. Then I can use Redis's "natural" expiration by simply setting as expiry time when I store it and then I don't have to worry about cleaning up old stuff at all.
The way I'm using Redis in this project is as a/the cache backend and I have it configured like this:
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": REDIS_URL,
"TIMEOUT": config("CACHE_TIMEOUT", 500),
"KEY_PREFIX": config("CACHE_KEY_PREFIX", ""),
"OPTIONS": {
"COMPRESSOR": "django_redis.compressors.zlib.ZlibCompressor",
"SERIALIZER": "django_redis.serializers.msgpack.MSGPackSerializer",
},
}
}
The speed difference
Perhaps unrealistic but I'm doing all this testing here on my MacBook Pro. The connection to Postgres (version 11.4) and Redis (3.2.1) are both on localhost
.
Reads
The reads are the most important because hopefully, they happen 10x more than writes as several people can benefit from previous saves.
I changed my code so that it would do a read from both databases and if it was found in both, write down their time in a log file which I'll later summarize. Results are as follows:
PG: median: 8.66ms mean : 11.18ms stdev : 19.48ms Redis: median: 0.53ms mean : 0.84ms stdev : 2.26ms (310 measurements)
It means, when focussing on the median, Redis is 16 times faster than PostgreSQL at reading these JSON blobs.
Writes
The writes are less important but due to the synchronous nature of my Django, the unlucky user who triggers a look up that I didn't have, will have to wait for the write before the XHR request can be completed. However, when this happens, the remote network call to the Amazon Product API is bound to be much slower. Results are as follows:
PG: median: 8.59ms mean : 8.58ms stdev : 6.78ms Redis: median: 0.44ms mean : 0.49ms stdev : 0.27ms (137 measurements)
It means, when focussing on the median, Redis is 20 times faster than PostgreSQL at writing these JSON blobs.
Conclusion and discussion
First of all, I'm still a PostgreSQL fan-boy and have no intention of ceasing that. These times are made up of much more than just the individual databases. For example, the PostgreSQL speeds depend on the Django ORM code that makes the SQL and sends the query and then turns it into the model instance. I don't know what the proportions are between that and the actual bytes-from-PG's-disk times. But I'm not sure I care either. The tooling around the database is inevitable mostly and it's what matters to users.
Both Redis and PostgreSQL are persistent and survive server restarts and crashes etc. And you get so many more "batch related" features with PostgreSQL if you need them, such as being able to get a list of the last 10 rows added for some post-processing batch job.
I'm currently using Django's cache framework, with Redis as its backend, and it's a cache framework. It's not meant to be a persistent database. I like the idea that if I really have to I can just flush the cache and although detrimental to performance (temporarily) it shouldn't be a disaster. So I think what I'll do is store these JSON blobs in both databases. Yes, it means roughly 6GB of SSD storage but it also potentially means loading a LOT more into RAM on my limited server. That extra RAM usage pretty much sums of this whole blog post; of course it's faster if you can rely on RAM instead of disk. Now I just need to figure out how RAM I can afford myself for this piece and whether it's worth it.
UPDATE September 29, 2019
I experimented with an optimization of NOT turning the Django ORM query into a model instance for each record. Instead, I did this:
+from dataclasses import dataclass
+@dataclass
+class _Lookup:
+ modified: datetime.datetime
+ matches: list
...
+base_qs = base_qs.values_list("modified", "matches")
-lookup = base_qs.get(song__id=song_id)
+lookup_tuple = base_qs.get(song__id=song_id)
+lookup = _Lookup(*lookup_tuple)
print(lookup.modified)
Basically, let the SQL driver's "raw Python" content come through the Django ORM. The old difference between PostgreSQL and Redis was 16x. The new difference was 14x instead.
Best way to count distinct indexed things in PostgreSQL
March 21, 2019
3 comments Django, PostgreSQL
tl;dr; SELECT COUNT(*) FROM (SELECT DISTINCT my_not_unique_indexed_column FROM my_table) t;
I have a table that looks like this:
songsearch=# \d main_songtexthash Table "public.main_songtexthash" Column | Type | Collation | Nullable | -----------+--------------------------+-----------+----------+ id | integer | | not null | text_hash | character varying(32) | | not null | created | timestamp with time zone | | not null | modified | timestamp with time zone | | not null | song_id | integer | | not null | Indexes: "main_songtexthash_pkey" PRIMARY KEY, btree (id) "main_songtexthash_song_id_key" UNIQUE CONSTRAINT, btree (song_id) "main_songtexthash_text_hash_c2771f1f" btree (text_hash) "main_songtexthash_text_hash_c2771f1f_like" btree (text_hash varchar_pattern_ops) Foreign-key constraints: ...snip...
And the data looks something like this:
songsearch=# select text_hash, song_id from main_songtexthash limit 10; text_hash | song_id ----------------------------------+--------- 6f98e1945e64353bead9d6ab47a7f176 | 2565031 0c6662363aa4a340fea5efa24c98db76 | 486091 a25af539b183cbc338409c7acecc6828 | 212 5aaf561b38c251e7d863aae61fe1363f | 2141077 6a221df60f7cbb8a4e604f87c9e3aec0 | 245186 d2a0b5b3b33cdf5e03a75cfbf4963a6f | 1453382 95c395dd78679120269518b19187ca80 | 981402 8ab19b32b3be2d592aa69e4417b732cd | 616848 8ab19b32b3be2d592aa69e4417b732cd | 243393 01568f1f57aeb7a97e2544978fc93b4c | 333 (10 rows)
If you look carefully, you'll notice that every song_id
has a different text_hash
except two of them.
Song IDs 616848
and 243393
both have the same text_hash
of value 8ab19b32b3be2d592aa69e4417b732cd
.
Also, if you imagine this table only has 10 rows, you could quickly and easily conclude that there are 10 different song_id
but 9 different distinct text_hash
. However, how do you do this counting if the tables are large??
The Wrong Way
songsearch=# select count(distinct text_hash) from main_songtexthash; count --------- 1825983 (1 row)
And the explanation and cost analysis is:
songsearch=# explain analyze select count(distinct text_hash) from main_songtexthash; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=44942.09..44942.10 rows=1 width=8) (actual time=40029.225..40029.226 rows=1 loops=1) -> Seq Scan on main_songtexthash (cost=0.00..40233.87 rows=1883287 width=33) (actual time=0.029..193.653 rows=1879521 loops=1) Planning Time: 0.059 ms Execution Time: 40029.250 ms (4 rows)
Oh noes! A Sec Scan
! Run!
The Right Way
Better explained in this blog post but basically, cutting to the chase, here's how you count on an indexed field:
songsearch=# select count(*) from (select distinct text_hash from main_songtexthash) t; count --------- 1825983 (1 row)
And the explanation and cost analysis is:
songsearch=# explain analyze select count(*) from (select distinct text_hash from main_songtexthash) t; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=193871.20..193871.21 rows=1 width=8) (actual time=4894.555..4894.556 rows=1 loops=1) -> Unique (cost=0.55..172861.54 rows=1680773 width=33) (actual time=0.075..4704.741 rows=1825983 loops=1) -> Index Only Scan using main_songtexthash_text_hash_c2771f1f on main_songtexthash (cost=0.55..168153.32 rows=1883287 width=33) (actual time=0.074..4132.822 rows=1879521 loops=1) Heap Fetches: 1879521 Planning Time: 0.082 ms Execution Time: 4894.581 ms (6 rows)
Same exact result but ~5s instead of ~40s. I'll take that, thank you very much.
The Django Way
As a bonus: Django is smart. Here's how they do it:
>>> SongTextHash.objects.values('text_hash').distinct().count()
1825983
And, the SQL it generates to make that count looks very familiar:
SELECT COUNT(*) FROM (SELECT DISTINCT "main_songtexthash"."text_hash" AS Col1 FROM "main_songtexthash") subquery
Conclusion
- Avoid "sequential scans" like the plague if you care about performance (...or not just killing your resources).
- Trust in Django.
Django ORM optimization story on selecting the least possible
February 22, 2019
16 comments Web development, Django, Python, PostgreSQL
This an optimization story that should not surprise anyone using the Django ORM. But I thought I'd share because I have numbers now! The origin of this came from a real requirement. For a given parent model, I'd like to extract the value of the name
column of all its child models, and the turn all these name
strings into 1 MD5 checksum string.
Variants
The first attempted looked like this:
artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
The SQL used to generate this is as follows:
SELECT "main_song"."id", "main_song"."artist_id", "main_song"."name",
"main_song"."text", "main_song"."language", "main_song"."key_phrases",
"main_song"."popularity", "main_song"."text_length", "main_song"."metadata",
"main_song"."created", "main_song"."modified",
"main_song"."has_lastfm_listeners", "main_song"."has_spotify_popularity"
FROM "main_song" WHERE "main_song"."artist_id" = 22729;
Clearly, I don't need anything but just the name
column, version 2:
artist = Artist.objects.get(name="Bad Religion")
names = []
for song in Song.objects.filter(artist=artist).only("name"):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
Now, the SQL used is:
SELECT "main_song"."id", "main_song"."name"
FROM "main_song" WHERE "main_song"."artist_id" = 22729;
But still, since I don't really need instances of model class Song
I can use the .values()
method which gives back a list of dictionaries. This is version 3:
names = []
for song in Song.objects.filter(artist=a).values("name"):
names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
This time Django figures it doesn't even need the primary key value so it looks like this:
SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729;
Last but not least; there is an even faster one. values_list()
. This time it doesn't even bother to map the column name to the value in a dictionary. And since I only need 1 column's value, I can set flat=True
. Version 4 looks like this:
names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
Same SQL gets used this time as in version 3.
The benchmark
Hopefully this little benchmark script speaks for itself:
from songsearch.main.models import *
import hashlib
def f1(a):
names = []
for song in Song.objects.filter(artist=a):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
def f2(a):
names = []
for song in Song.objects.filter(artist=a).only("name"):
names.append(song.name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
def f3(a):
names = []
for song in Song.objects.filter(artist=a).values("name"):
names.append(song["name"])
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
def f4(a):
names = []
for name in Song.objects.filter(artist=a).values_list("name", flat=True):
names.append(name)
return hashlib.md5("".join(names).encode("utf-8")).hexdigest()
artist = Artist.objects.get(name="Bad Religion")
print(Song.objects.filter(artist=artist).count())
print(f1(artist) == f2(artist))
print(f2(artist) == f3(artist))
print(f3(artist) == f4(artist))
# Reporting
import time
import random
import statistics
functions = f1, f2, f3, f4
times = {f.__name__: [] for f in functions}
for i in range(500):
func = random.choice(functions)
t0 = time.time()
func(artist)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
for name in sorted(times):
numbers = times[name]
print("FUNCTION:", name, "Used", len(numbers), "times")
print("\tBEST", min(numbers))
print("\tMEDIAN", statistics.median(numbers))
print("\tMEAN ", statistics.mean(numbers))
print("\tSTDEV ", statistics.stdev(numbers))
I ran this on my PostgreSQL 11.1 on my MacBook Pro with Django 2.1.7. So the database is on localhost
.
The results
276 True True True FUNCTION: f1 Used 135 times BEST 6.309986114501953 MEDIAN 7.531881332397461 MEAN 7.834429211086697 STDEV 2.03779968066591 FUNCTION: f2 Used 135 times BEST 3.039121627807617 MEDIAN 3.7298202514648438 MEAN 4.012803678159361 STDEV 1.8498943539073027 FUNCTION: f3 Used 110 times BEST 0.9920597076416016 MEDIAN 1.4405250549316406 MEAN 1.5053835782137783 STDEV 0.3523240470133114 FUNCTION: f4 Used 120 times BEST 0.9369850158691406 MEDIAN 1.3251304626464844 MEAN 1.4017681280771892 STDEV 0.3391019435930447
Discussion
I guess the hashlib.md5("".join(names).encode("utf-8")).hexdigest()
stuff is a bit "off-topic" but I checked and it's roughly 300 times faster than building up the names
list.
It's clearly better to ask less of Python and PostgreSQL to get a better total time. No surprise there. What was interesting was the proportion of these differences. Memorize that and you'll be better equipped if it's worth the hassle of not using the Django ORM in the most basic form.
Also, do take note that this is only relevant in when dealing with many records. The slowest variant (f1
) takes, on average, 7 milliseconds.
Summarizing the difference with percentages compared to the fastest variant:
f1
- 573% slowerf2
- 225% slowerf3
- 6% slowerf4
- 0% slower
UPDATE Feb 25 2019
James suggested, although a bit "missing the point", that it could be even faster if all the aggregation is pushed into the PostgreSQL server and then the only thing that needs to transfer from PostgreSQL to Python is the final result.
By the way, name
column in this particular benchmark, when concatenated into one big string, is ~4KB. So, with variant f5
it only needs to transfer 32 bytes which will/would make a bigger difference if the network latency is higher.
Here's the whole script: https://gist.github.com/peterbe/b2b7ed95d422ab25a65639cb8412e75e
And the results:
276 True True True False False FUNCTION: f1 Used 92 times BEST 5.928993225097656 MEDIAN 7.311463356018066 MEAN 7.594626882801885 STDEV 2.2027017044658423 FUNCTION: f2 Used 75 times BEST 2.878904342651367 MEDIAN 3.3979415893554688 MEAN 3.4774907430013022 STDEV 0.5120246550765524 FUNCTION: f3 Used 88 times BEST 0.9310245513916016 MEDIAN 1.1944770812988281 MEAN 1.3105544176968662 STDEV 0.35922655625999383 FUNCTION: f4 Used 71 times BEST 0.7879734039306641 MEDIAN 1.1661052703857422 MEAN 1.2262606284987758 STDEV 0.3561764250427344 FUNCTION: f5 Used 90 times BEST 0.7929801940917969 MEDIAN 1.0334253311157227 MEAN 1.1836051940917969 STDEV 0.4001442703048186 FUNCTION: f6 Used 84 times BEST 0.80108642578125 MEDIAN 1.1119842529296875 MEAN 1.2281338373819988 STDEV 0.37146893005516973
Result: f5
is takes 0.793ms and (the previous "winner") f4
takes 0.788ms.
I'm not entirely sure why f5
isn't faster but I suspect it's because the dataset is too small for it all to matter.
Compare:
songsearch=# explain analyze SELECT "main_song"."name" FROM "main_song" WHERE "main_song"."artist_id" = 22729; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Index Scan using main_song_ca949605 on main_song (cost=0.43..229.33 rows=56 width=16) (actual time=0.014..0.208 rows=276 loops=1) Index Cond: (artist_id = 22729) Planning Time: 0.113 ms Execution Time: 0.242 ms (4 rows)
with...
songsearch=# explain analyze SELECT md5(STRING_AGG("main_song"."name", '')) AS "names_hash" FROM "main_song" WHERE "main_song"."artist_id" = 22729; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=229.47..229.48 rows=1 width=32) (actual time=0.278..0.278 rows=1 loops=1) -> Index Scan using main_song_ca949605 on main_song (cost=0.43..229.33 rows=56 width=16) (actual time=0.019..0.204 rows=276 loops=1) Index Cond: (artist_id = 22729) Planning Time: 0.115 ms Execution Time: 0.315 ms (5 rows)
I ran these two SQL statements about 100 times each and recorded their best possible execution times:
1) The plain SELECT
- 0.99ms
2) The STRING_AGG
- 1.06ms
So that accounts from ~0.1ms difference only! Which kinda matches the results seen above. All in all, I think the dataset is too small to demonstrate this technique. But, considering the chance that the complexity might not be linear with the performance benefit, it's still interesting.
Even though this tangent is a big off-topic, it is often a great idea to push as much work into the database as you can if applicable. Especially if it means you can transfer a lot less data eventually.
How I performance test PostgreSQL locally on macOS
December 10, 2018
2 comments Web development, macOS, PostgreSQL
It's weird to do performance analysis of a database you run on your laptop. When testing some app, your local instance probably has 1/1000 the amount of realistic data compared to a production server. Or, you're running a bunch of end-to-end integration tests whose PostgreSQL performance doesn't make sense to measure.
Anyway, if you are doing some performance testing of an app that uses PostgreSQL one great tool to use is pghero. I use it for my side-projects and it gives me such nice insights into slow queries that I'm willing to live with the cost that it is to run it on a production database.
This is more of a brain dump of how I run it locally:
First, you need to edit your postgresql.conf
. Even if you used Homebrew to install it, it's not clear where the right config file is. Start psql
(on any database) and type this to find out which file is the one:
$ psql kintobench
kintobench=# show config_file;
config_file
-----------------------------------------
/usr/local/var/postgres/postgresql.conf
(1 row)
Now, open /usr/local/var/postgres/postgresql.conf
and add the following lines:
# Peterbe: From Pghero's configuration help. shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all
Now, to restart the server use:
▶ brew services restart postgresql
Stopping `postgresql`... (might take a while)
==> Successfully stopped `postgresql` (label: homebrew.mxcl.postgresql)
==> Successfully started `postgresql` (label: homebrew.mxcl.postgresql)
The next thing you need is pghero
itself and it's easy to run in docker. So to start, you need Docker for mac installed. You also need to know the database URL. Here's how I ran it:
docker run -ti -e DATABASE_URL=postgres://peterbe:@host.docker.internal:5432/kintobench -p 8080:8080 ankane/pghero
Note the trick of peterbe:@host.docker.internal
because I don't use a password but inside the Docker container it doesn't know my terminal username. And the host.docker.internal
is so the Docker container can reach the PostgreSQL installed on the host.
Once that starts up you can go to http://localhost:8080
in a browser and see a listing of all the cumulatively slowest queries. There are other cool features in pghero
too that you can immediately benefit from such as hints about unused/redundent database indices.
Hope it helps!