How to JSON schema validate 10x (or 100x) faster in Python

04 November 2018   9 comments   Python

This is perhaps insanely obvious but it was a measurement I had to do and it might help you too if you use python-jsonschema a lot too.

I have this project which has a migration script that needs to transfer about 1M records from one PostgreSQL database, transform it a bit, validate it, and store it in another PostgreSQL database. The validation step was done like this:

from jsonschema import validate

...

with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
    SCHEMA = yaml.load(f)["schema"]

...


class Build(models.Model):

    ...

    @classmethod
    def validate_build(cls, build):
        validate(build, SCHEMA)

That works fine when you have a slow trickle of these coming in with many seconds or minutes apart. But when you have to do about 1M of them, the speed overhead starts to really matter. Granted, in this context, it's just a migration which is hopefully only done once but it helps that it doesn't take too long since it makes it easier to not have any downtime.

What about python-fastjsonschema?

The name python-fastjsonschema just sounds very appealing but I'm just not sure how mature it is or what the subtle differences are between that and the more established python-jsonschema which I was already using.

It has two ways of using it either...

fastjsonschema.validate(schema, data)

...or...

validator = fastjsonschema.compile(schema)
validator(data)

That got me thinking, why don't I just do that with regular python-jsonschema!
All you need to do is crack open the validate function and you can now re-used one instance for multiple pieces of data:

from jsonschema.validators import validator_for


klass = validator_for(schema)
klass.check_schema(schema)  # optional
instance = klass(SCHEMA)
instance.validate(data)

I rewrote my projects code to this:

from jsonschema import validate

...

with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
    SCHEMA = yaml.load(f)["schema"]
_validator_class = validator_for(SCHEMA)
_validator_class.check_schema(SCHEMA)
validator = _validator_class(SCHEMA)

...


class Build(models.Model):

    ...

    @classmethod
    def validate_build(cls, build):
        validator.validate(build)

How do they compare, performance-wise?

Let this simple benchmark code speak for itself:

from buildhub.main.models import Build, SCHEMA

import fastjsonschema
from jsonschema import validate, ValidationError
from jsonschema.validators import validator_for


def f1(qs):
    for build in qs:
        validate(build.build, SCHEMA)


def f2(qs):
    validator = validator_for(SCHEMA)
    for build in qs:
        validate(build.build, SCHEMA, cls=validator)


def f3(qs):
    cls = validator_for(SCHEMA)
    cls.check_schema(SCHEMA)
    instance = cls(SCHEMA)
    for build in qs:
        instance.validate(build.build)


def f4(qs):
    for build in qs:
        fastjsonschema.validate(SCHEMA, build.build)


def f5(qs):
    validator = fastjsonschema.compile(SCHEMA)
    for build in qs:
        validator(build.build)


# Reporting
import time
import statistics
import random

functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}


for _ in range(3):
    qs = list(Build.objects.all().order_by("?")[:1000])
    for func in functions:
        t0 = time.time()
        func(qs)
        t1 = time.time()
        times[func.__name__].append((t1 - t0) * 1000)


def f(ms):
    return f"{ms:.1f}ms"


for name, numbers in times.items():
    print("FUNCTION:", name, "Used", len(numbers), "times")
    print("\tBEST  ", f(min(numbers)))
    print("\tMEDIAN", f(statistics.median(numbers)))
    print("\tMEAN  ", f(statistics.mean(numbers)))
    print("\tSTDEV ", f(statistics.stdev(numbers)))

Basically, 3 times for each of the alternative implementations, do a validation on a 1,000 JSON blobs (technically Python dicts) that is around 1KB, each, in size.

The results:

FUNCTION: f1 Used 3 times
    BEST   1247.9ms
    MEDIAN 1309.0ms
    MEAN   1330.0ms
    STDEV  94.5ms
FUNCTION: f2 Used 3 times
    BEST   1266.3ms
    MEDIAN 1267.5ms
    MEAN   1301.1ms
    STDEV  59.2ms
FUNCTION: f3 Used 3 times
    BEST   125.5ms
    MEDIAN 131.1ms
    MEAN   133.9ms
    STDEV  10.1ms
FUNCTION: f4 Used 3 times
    BEST   2032.3ms
    MEDIAN 2033.4ms
    MEAN   2143.9ms
    STDEV  192.3ms
FUNCTION: f5 Used 3 times
    BEST   16.7ms
    MEDIAN 17.1ms
    MEAN   21.0ms
    STDEV  7.1ms

Basically, if you use python-jsonschema and create a reusable instance it's 10 times faster than the "default way". And if you do the same but with python-fastjsonscham it's 100 times faster.

By the way, in version f5 it validated 1,000 1KB records in 16.7ms. That's insanely fast!

Comments

Michal Hořejšek

Hi. Author of the Fast JSON Schema here. :-)

I wrote about details of the project here: https://blog.horejsek.com/fastjsonschema/ It's ready for production code and offers full support of JSON Schema Draft 04, 06 and 07.

The reason why f4 is slow is that it creates Python code on the fly in every cycle. Using validate directly is really only when you are lazy and it's one time usage. To have high performance you should always use compile.

BTW you can gain little bit also by generating Python code to the file and import that instead. Maybe you could try to do f6. It should be even slightly better. :-) You can generate validation module for your schema with following command: echo "{'type': 'string'}" | python3 -m fastjsonschema > your_file.py (or use fastjsonschema.compile_to_code on your own: https://horejsek.github.io/python-fastjsonschema/#fastjsonschema.compile_to_code)

Peter Bengtsson

Thanks for sharing!

What happened in my case was that...
1) I need faster JSON schema validation
2) Let's check out Fast JSON Schema
3) Huh! How about that! You create the instance once and reuse it. Why don't I just do that with the existing stack?
4) Reusing existing stack but doing the create-instance-once pattern.
5) Totally good enough for now.

I hope my blog post shines some light - plus your comment here - about the fact that there is an alternative to regular python-jsonschema that is production grade and distinctly faster.

Julian Berman

Hi! jsonschema author here :)

One minor point that worries me here -- I'm curious as to why you had to "crack open the validate function" to find the validator API -- if you have suggestions on how to improve the documentation they'd be very welcome. That API is very much not internal, and I'd have thought that the docs at https://python-jsonschema.readthedocs.io/en/stable/validate/ would have led you right to it, so if you have a suggestion on what you'd have needed to see there I'd love to hear it.

And as a "philosophical" rule, `jsonschema` does not prioritize its performance on CPython. If someone notices slowness on CPython and sends a patch that doesn't slow things down elsewhere I've been happy to merge it, but I personally always prioritize performance on PyPy (and it's the only thing I look at or compare). So I'm keen to re-run these there and see what the results look like.

Also -- would you mind confirming what the license is of your benchmark? I'm considering adding it to `jsonschema`'s benchmark suite if you tell me it's something permissive :)

Peter Bengtsson

Hi,
The code on https://github.com/Julian/jsonschema (the README) only shows the `jsonschema.validate` function which forces the creation of a schema class instance every single time. There is no mention on the README about the trick of accessing the class, instantiating it once, and calling its `validate` function repeatedly.

Also, the docs on https://python-jsonschema.readthedocs.io/en/stable/validate/ demonstrate the same convenient function (that does the class instantiation on every single entry (even though the schema hasn't changed).

I think we could add a piece somewhere about the fact that "If you have multiple entries all with the same schema, consider this patterrn..."

Regarding license for the benchmark, you have my written consent right here right now to do whatever you want with it. It's not licensed so you don't even have to attribute.

Keep up the good work!

Julian Berman

Thanks (on both!)

Let me know if https://github.com/Julian/jsonschema/commit/2e082b58e44356a4acd7832f46cbf91423373380 seems like what would have helped.

Peter Bengtsson

It helps but I think it would still be a good idea to mention it in that first little code snippet in the README

Julian Berman

The README is a README, not really documentation -- to be honest I'd remove all the code from there entirely if it wasn't that the README is what's used for PyPI and is what you see when you load the repo, so it's *something* for someone to see. But beyond "show me what this library does in one sentence" I'd really expect someone to read the documentation.

But will think about it.

Peter Bengtsson

You're not wrong, it's just that reality is a like that. What code snippets ones seems in the README is usually all your eyes have time to scan.

Granted, if the project is your main at-work project and quality is super important then it might be a different story. So often, it's just one of many projects and the thing you're using a library for might not be a critical thing so you're looking for a quick fix and that's what the code snippets in the README are for.

If you think there are dangers with skimming a snippet like that I would remove it replace it with a link into the "meat of the documentation".

Anonymous

Great sounds good!

Your email will never ever be published

Related posts

Previous:
React.memo instead of React.PureComponent 02 November 2018
Next:
hashin 0.14.0 with --update-all and a bunch of other features 13 November 2018
Related by Keyword:
How I made my MongoDB based web app 10 times faster 21 October 2010
XHTML Transitional versus Strict 05 November 2004