Filtered by PostgreSQL

Page 3

Reset

crontabber now supports locking, both high- and low-level

March 4, 2017
0 comments Python, Mozilla, PostgreSQL

tl;dr; In other words, you can now have multiple servers with crontabber, all talking to a central PostgreSQL for its state, and not have to worry about jobs being started more than exactly once. This will be super useful if your crontabber apps are such that they kick of stored procedures that would freak out if run more than once with the same parameters.

crontabber is an advanced Python program to run cron-like applications in a predictable way. Every app is a Python class with a run method. Example here. Until version 0.18 you had to do locking outside but now the locking has been "internalized". Meaning, if you open two terminals and run python crontabber.py --admin.conf=myconfig.ini in both you don't have to worry about it starting the same apps in parallel.

General, business logic locking

Every app has a state. It's stored in PostgreSQL. It looks like this:

# \d crontabber
              Table "public.crontabber"
    Column    |           Type           | Modifiers
--------------+--------------------------+-----------
 app_name     | text                     | not null
 next_run     | timestamp with time zone |
 first_run    | timestamp with time zone |
 last_run     | timestamp with time zone |
 last_success | timestamp with time zone |
 error_count  | integer                  | default 0
 depends_on   | text[]                   |
 last_error   | json                     |
 ongoing      | timestamp with time zone |
Indexes:
    "crontabber_unique_app_name_idx" UNIQUE, btree (app_name)

The last column, ongoing used to be just for the "curiosity". For example, in Socorro we used that to display a flashing message about which jobs are ongoing right now.

As of version 0.18, this ongoing column is actually used to NOT run apps again. Basically, when started, crontabber figures out which app to run next (assuming it's time to run it) and now the first thing it does is look up if it's ongoing already, and if it is the whole crontabber application exits with an error code of 3.

Sub-second locking

What might happen is that two separate servers which almost perfectly synchronoized clocks might have cron run crontabber at the "exact" same time. Or rather, only a few milliseconds apart. But the database is central so what might happen is that two distinct PostgreSQL connection tries to send a... UPDATE crontabber SET ongoing=now() WHERE app_name='some-app-name' at the very same time.

So how is this solved? The answer is row-level locking. The magic sauce is here. You make a select, by app_name with a suffix of FOR UPDATE WAIT. Imagine two distinct PostgreSQL connections sending this:


BEGIN;
SELECT ongoing FROM crontabber WHERE app_name = 'my-app-name'
FOR UPDATE NOWAIT;

-- do some other stuff in Python

UPDATE crontabber SET ongoing = now() WHERE app_name = 'my-app-name';
COMMIT;

One of them will succeed the other will raise an error. Now all you need to do is catch that raised error, check that it's a row-level locking error and not some other general error. Instead of worrying about the raised error you just accept it and exit the program early.

This screenshot of a test.sql script demonstrates this:

Two distinct terminals sending an UPDATE to psql. One will error.
Two terminals lined up and I start one and quickly switch and start the other one

Another way to demonstrate this is to use psycopg2 in a little script:


import threading
import psycopg2


def updater():
    connection = psycopg2.connect('dbname=crontabber_exampleapp')
    cursor = connection.cursor()
    cursor.execute("""
    SELECT ongoing FROM crontabber WHERE app_name = 'bar'
    FOR UPDATE NOWAIT
    """)
    cursor.execute("""
    UPDATE crontabber SET ongoing = now() WHERE app_name = 'bar'
    """)
    print("JOB CAN START!")
    connection.commit()


# Use threads to simulate starting two connections virtually 
# simultaneously.
threads = [
    threading.Thread(target=updater),
    threading.Thread(target=updater),
]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

The output of this is:

▶ python /tmp/test.py
JOB CAN START!
Exception in thread Thread-1:
Traceback (most recent call last):
...
OperationalError: could not obtain lock on row in relation "crontabber"

With threads, you never know exactly which one will work and which one will not. In this case it was Thread-1 that sent its SQL a couple of nanoseconds too late.

In conclusion...

As of version 0.18 of crontabber, all locking is now dealt with inside crontabber. You still kick off crontabber from cron or crontab but if your cron does kick it off whilst it's still in the midst of running a job, it will simply exit with an error code of 2 or 3.

In other words, you can now have multiple servers with crontabber, all talking to a central PostgreSQL for its state, and not have to worry about jobs being started more than exactly once. This will be super useful if your crontabber apps are such that they kick of stored procedures that would freak out if run more than once with the same parameters.

Optimization of QuerySet.get() with or without select_related

November 3, 2016
1 comment Python, Django, PostgreSQL

If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:


class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...


class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:


>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:


>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:


def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

  • The server is a 4Gb DigitialOcean server that is being used but not super busy
  • The PostgreSQL server is running on the same virtual machine as the Python process used in the benchmark
  • Django 1.10, Python 3.5, PostgreSQL 9.5.4
  • 1,588,258 "Song" objects, 102,496 "Artist" objects
  • Note that the range of IDs has gaps but they're not counted

The Result

Function Median Average Std Dev
f1 3.19ms 9.17ms 19.61ms
f2 2.28ms 6.28ms 15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.

Select all relations in PostgreSQL

December 10, 2015
1 comment PostgreSQL

tl;dr Start psql with -E or --echo-hidden

I wanted to find out EVERYTHING that's related to a specific topic. Tables, views, stored procedures etc.
One way of doing that is to go into psql and type \d and/or \df and look through that list. But that's unpractical if it gets large and I might want to get it out stdout instead so I can grep and grep -v.

There are lots of Stackoverflow questions about how to SQL select all tables but I want it all. The solution is to start psql with -E or --echo-hidden. When you do that, it prints out what SQL it used to generate the output for you there. You can then copy that and do whatever you want to do with it. For example:

peterbecom=# \d
********* QUERY **********
SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','v','m','S','f','')
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
      AND n.nspname !~ '^pg_toast'
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;
**************************

With this I was able to come up with this SQL select to get all tables, views, sequences and functions.


SELECT
  c.relname as "Name",
  CASE c.relkind WHEN 'r' THEN 'table'
  WHEN 'v' THEN 'view'
  WHEN 'm' THEN 'materialized view'
  WHEN 'i' THEN 'index'
  WHEN 'S' THEN 'sequence'
  WHEN 's' THEN 'special'
  WHEN 'f' THEN 'foreign table' END as "Type"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','v','m','S','f','')
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
      AND n.nspname !~ '^pg_toast'
  AND pg_catalog.pg_table_is_visible(c.oid);


SELECT
  p.proname as "Name",
  'function'
FROM pg_catalog.pg_proc p
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema';

A usecase of this is that I put those two SQL selects in a file and now I can grep:

$ psql mydatabase < everything.sql | grep -i crap

Weight of your PostgreSQL tables "lumped together"

October 31, 2015
0 comments PostgreSQL

UPDATE June 2020

That first SQL isn't working in Postgres 12 and onwards. Use this instead:


SELECT relname AS "table_name",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  AND relkind = 'r'
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 10;

We have lots of tables that weigh a lot. Some of the tables are partitions so they're called "mytable_20150901" and "mytable_20151001" etc.

To find out how much each table weighs you can use this query:


select table_name, pg_relation_size(table_name), pg_size_pretty(pg_relation_size(table_name))
from information_schema.tables
where table_schema = 'public'
order by 2 desc limit 10;

It'll give you an output like this:

table_name        | pg_relation_size | pg_size_pretty
--------------------------+------------------+----------------
 raw_adi_logs             |      14724538368 | 14 GB
 raw_adi                  |      14691426304 | 14 GB
 tcbs                     |       7173865472 | 6842 MB
 exploitability_reports   |       6512738304 | 6211 MB
 reports_duplicates       |       4428742656 | 4224 MB
 addresses                |       4120412160 | 3930 MB
 missing_symbols_20150601 |       3264897024 | 3114 MB
 missing_symbols_20150608 |       3170762752 | 3024 MB
 missing_symbols_20150622 |       3039731712 | 2899 MB
 missing_symbols_20150615 |       2967281664 | 2830 MB
(10 rows)

But as you can see in this example, it might be interesting to know what the sum is of all the missing_symbols_* partitions.

Without further ado, here's how you do that:


select table_name, total, pg_size_pretty(total)
from (
  select trim(trailing '_0123456789' from table_name) as table_name, 
  sum(pg_relation_size(table_name)) as total
  from information_schema.tables
  where table_schema = 'public'
  group by 1
) as agg
order by 2 desc limit 10;

Then you'll get possibly very different results:

table_name        |    total     | pg_size_pretty
--------------------------+--------------+----------------
 reports_user_info        | 157111115776 | 146 GB
 reports_clean            | 106995695616 | 100 GB
 reports                  | 100983242752 | 94 GB
 missing_symbols          |  42231529472 | 39 GB
 raw_adi_logs             |  14724538368 | 14 GB
 raw_adi                  |  14691426304 | 14 GB
 extensions               |  12237242368 | 11 GB
 tcbs                     |   7173865472 | 6842 MB
 exploitability_reports   |   6512738304 | 6211 MB
 signature_summary_uptime |   6027468800 | 5748 MB
(10 rows)

You can read more about the trim() function here.

How to log ALL PostgreSQL SQL happening

July 20, 2015
1 comment MacOSX, PostgreSQL

When you're using PostgreSQL for local development it's sometimes important to get an insight into ALL SQL that happens on the PostgreSQL server. Especially if you're trying to debug all the transaction action too.

To do this on OSX where you have PostgreSQL installed with Homebrew you have to do the following:

1. Locate the right postgresql.conf file. On my computer this is in /opt/boxen/homebrew/var/postgres/ but that might vary depending on how you set up Homebrew. Another easier way is to just start psql and ask there:

$ psql
psql (9.4.0)
Type "help" for help.

peterbe=# show config_file;
                   config_file
--------------------------------------------------
 /opt/boxen/homebrew/var/postgres/postgresql.conf
(1 row)

peterbe=#

Open that file in your favorite editor.

2. Look for a line that looks like this:

#log_statement = 'all'                   # none, ddl, mod, all

Uncomment that line.

3. Now if you can't remember how to restart PostgreSQL on your system you can ask brew:

$ brew info postgresql

Towards the end you'll see some files that look something like this:

$ launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

4. To locate where PostgreSQL dumps all logging, you have to look to how Homebrew set it up. You can do that by opening ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist. You'll find something like this:

<key>StandardErrorPath</key>
  <string>/opt/boxen/homebrew/var/postgres/server.log</string>

That's it. You can now see all SQL going on by running:

$ tail -f /opt/boxen/homebrew/var/postgres/server.log

Remember to reverse this config change when you're done debugging because that server.log file can quickly grow to an insane size since it's probably not log rotated.

Happy SQL debuggin'!

UPDATE Dec 2018

Another way to use Homebrew to restart PostgreSQL is with the command:

$ brew services restart postgresql

Crontabber

June 12, 2014
2 comments Python, Mozilla, PostgreSQL

Yesterday I had the good fortune to present Crontabber to the The San Francisco Bay Area PostgreSQL Meetup Group organized by my friend Josh Berkus.

To spare you having to watch the whole presentation I'm going to summarize some of it here.

My colleague Lars also has a blog post about Crontabber and goes into a bit more details about the nitty gritty of using PostgreSQL.

What is crontabber?

It's a tool for running cron jobs. It's written in Python and PostgreSQL and it's a tool we need for Socorro which has lots and lots of stored procedures and cron jobs.

So it's basically a script that gets started by UNIX crontab and runs every 5 minutes. Internally it keeps an index of all the apps it needs to run and it manages dependencies between jobs and is self-healing meaning that if something goes wrong during run-time it just retries again and again until it works. Amongst many of the juicy features it offers on top of regular "cron wrappers" is something we call "backfilling".

Backfill jobs are jobs that happen on a periodic basis and get given a date. If all is working this date is the same as "now()" but if something was to go wrong it remembers all the dates that did not work and re-attempts with those exact dates. That means that you can guarantee that the job gets started with every date possible even if it needs to catch up on past dates.

There's plenty of documentation on how to install and create jobs but it all starts with a simple pip install crontabber.

To get a feel for how you write crontabber "apps", checkout the ones for Socorro or flick through the slides in the PDF.

Is it mature?

Yes! It all started in early 2012 as a part of the Socorro code base and after some hard months of it stabalizing and maturing we decided to extract it out of Socorro and into its own project on GitHub under the Mozilla Public Licence 2.0 licence. Now it stands on its own legs and has no longer anything to do with Socorro and can be used for anything and anyone who has a lot of complicated cron jobs that need to run in Python with a PostgreSQL connection. In Socorro we use it primarily for executing stored procedures but that's just one type of app. You can also make it call out on to anything the command line with a "@with_subprocess" decorator.

Is it finished?

No. It works really well for us but there's a decent list of features we want to add. The hope is that by open sourcing it we can get other organizations to adopt it and not only find bugs but also contribute code to add more juicy features.

One of the soon-to-come features we have in mind is to "internalize locking". At the moment you have to wrap it in a bash script that prevents it from being run concurrently. Crontabber is single-threaded and we don't have to worry about "dependent jobs" starting before "parent jobs" because the depdendencies and their state is stored in the database. But you might need to worry about the same job (the one due next) to be started concurrently. By internalizing the locking we can store, in the state database, that a particular job is being started on and thus not have to worry about it starting the same job twice.

I really hope this project can grow and continue to support us in our needs.