This is perhaps insanely obvious but it was a measurement I had to do and it might help you too if you use python-jsonschema
a lot too.
I have this project which has a migration script that needs to transfer about 1M records from one PostgreSQL database, transform it a bit, validate it, and store it in another PostgreSQL database. The validation step was done like this:
from jsonschema import validate
...
with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
SCHEMA = yaml.load(f)["schema"]
...
class Build(models.Model):
...
@classmethod
def validate_build(cls, build):
validate(build, SCHEMA)
That works fine when you have a slow trickle of these coming in with many seconds or minutes apart. But when you have to do about 1M of them, the speed overhead starts to really matter. Granted, in this context, it's just a migration which is hopefully only done once but it helps that it doesn't take too long since it makes it easier to not have any downtime.
What about python-fastjsonschema
?
The name python-fastjsonschema
just sounds very appealing but I'm just not sure how mature it is or what the subtle differences are between that and the more established python-jsonschema
which I was already using.
It has two ways of using it either...
fastjsonschema.validate(schema, data)
...or...
validator = fastjsonschema.compile(schema)
validator(data)
That got me thinking, why don't I just do that with regular python-jsonschema
!
All you need to do is crack open the validate
function and you can now re-used one instance for multiple pieces of data:
from jsonschema.validators import validator_for
klass = validator_for(schema)
klass.check_schema(schema) # optional
instance = klass(SCHEMA)
instance.validate(data)
I rewrote my projects code to this:
from jsonschema import validate
...
with open(os.path.join(settings.BASE_DIR, "schema.yaml")) as f:
SCHEMA = yaml.load(f)["schema"]
_validator_class = validator_for(SCHEMA)
_validator_class.check_schema(SCHEMA)
validator = _validator_class(SCHEMA)
...
class Build(models.Model):
...
@classmethod
def validate_build(cls, build):
validator.validate(build)
How do they compare, performance-wise?
Let this simple benchmark code speak for itself:
from buildhub.main.models import Build, SCHEMA
import fastjsonschema
from jsonschema import validate, ValidationError
from jsonschema.validators import validator_for
def f1(qs):
for build in qs:
validate(build.build, SCHEMA)
def f2(qs):
validator = validator_for(SCHEMA)
for build in qs:
validate(build.build, SCHEMA, cls=validator)
def f3(qs):
cls = validator_for(SCHEMA)
cls.check_schema(SCHEMA)
instance = cls(SCHEMA)
for build in qs:
instance.validate(build.build)
def f4(qs):
for build in qs:
fastjsonschema.validate(SCHEMA, build.build)
def f5(qs):
validator = fastjsonschema.compile(SCHEMA)
for build in qs:
validator(build.build)
# Reporting
import time
import statistics
import random
functions = f1, f2, f3, f4, f5
times = {f.__name__: [] for f in functions}
for _ in range(3):
qs = list(Build.objects.all().order_by("?")[:1000])
for func in functions:
t0 = time.time()
func(qs)
t1 = time.time()
times[func.__name__].append((t1 - t0) * 1000)
def f(ms):
return f"{ms:.1f}ms"
for name, numbers in times.items():
print("FUNCTION:", name, "Used", len(numbers), "times")
print("\tBEST ", f(min(numbers)))
print("\tMEDIAN", f(statistics.median(numbers)))
print("\tMEAN ", f(statistics.mean(numbers)))
print("\tSTDEV ", f(statistics.stdev(numbers)))
Basically, 3 times for each of the alternative implementations, do a validation on a 1,000 JSON blobs (technically Python dicts) that is around 1KB, each, in size.
The results:
FUNCTION: f1 Used 3 times BEST 1247.9ms MEDIAN 1309.0ms MEAN 1330.0ms STDEV 94.5ms FUNCTION: f2 Used 3 times BEST 1266.3ms MEDIAN 1267.5ms MEAN 1301.1ms STDEV 59.2ms FUNCTION: f3 Used 3 times BEST 125.5ms MEDIAN 131.1ms MEAN 133.9ms STDEV 10.1ms FUNCTION: f4 Used 3 times BEST 2032.3ms MEDIAN 2033.4ms MEAN 2143.9ms STDEV 192.3ms FUNCTION: f5 Used 3 times BEST 16.7ms MEDIAN 17.1ms MEAN 21.0ms STDEV 7.1ms
Basically, if you use python-jsonschema
and create a reusable instance it's 10 times faster than the "default way". And if you do the same but with python-fastjsonscham
it's 100 times faster.
By the way, in version f5
it validated 1,000 1KB records in 16.7ms. That's insanely fast!
Comments
Post your own commentHi. Author of the Fast JSON Schema here. :-)
I wrote about details of the project here: https://blog.horejsek.com/fastjsonschema/ It's ready for production code and offers full support of JSON Schema Draft 04, 06 and 07.
The reason why f4 is slow is that it creates Python code on the fly in every cycle. Using validate directly is really only when you are lazy and it's one time usage. To have high performance you should always use compile.
BTW you can gain little bit also by generating Python code to the file and import that instead. Maybe you could try to do f6. It should be even slightly better. :-) You can generate validation module for your schema with following command: echo "{'type': 'string'}" | python3 -m fastjsonschema > your_file.py (or use fastjsonschema.compile_to_code on your own: https://horejsek.github.io/python-fastjsonschema/#fastjsonschema.compile_to_code)
Thanks for sharing!
What happened in my case was that...
1) I need faster JSON schema validation
2) Let's check out Fast JSON Schema
3) Huh! How about that! You create the instance once and reuse it. Why don't I just do that with the existing stack?
4) Reusing existing stack but doing the create-instance-once pattern.
5) Totally good enough for now.
I hope my blog post shines some light - plus your comment here - about the fact that there is an alternative to regular python-jsonschema that is production grade and distinctly faster.
Hi! jsonschema author here :)
One minor point that worries me here -- I'm curious as to why you had to "crack open the validate function" to find the validator API -- if you have suggestions on how to improve the documentation they'd be very welcome. That API is very much not internal, and I'd have thought that the docs at https://python-jsonschema.readthedocs.io/en/stable/validate/ would have led you right to it, so if you have a suggestion on what you'd have needed to see there I'd love to hear it.
And as a "philosophical" rule, `jsonschema` does not prioritize its performance on CPython. If someone notices slowness on CPython and sends a patch that doesn't slow things down elsewhere I've been happy to merge it, but I personally always prioritize performance on PyPy (and it's the only thing I look at or compare). So I'm keen to re-run these there and see what the results look like.
Also -- would you mind confirming what the license is of your benchmark? I'm considering adding it to `jsonschema`'s benchmark suite if you tell me it's something permissive :)
Hi,
The code on https://github.com/Julian/jsonschema (the README) only shows the `jsonschema.validate` function which forces the creation of a schema class instance every single time. There is no mention on the README about the trick of accessing the class, instantiating it once, and calling its `validate` function repeatedly.
Also, the docs on https://python-jsonschema.readthedocs.io/en/stable/validate/ demonstrate the same convenient function (that does the class instantiation on every single entry (even though the schema hasn't changed).
I think we could add a piece somewhere about the fact that "If you have multiple entries all with the same schema, consider this patterrn..."
Regarding license for the benchmark, you have my written consent right here right now to do whatever you want with it. It's not licensed so you don't even have to attribute.
Keep up the good work!
Thanks (on both!)
Let me know if https://github.com/Julian/jsonschema/commit/2e082b58e44356a4acd7832f46cbf91423373380 seems like what would have helped.
It helps but I think it would still be a good idea to mention it in that first little code snippet in the README
The README is a README, not really documentation -- to be honest I'd remove all the code from there entirely if it wasn't that the README is what's used for PyPI and is what you see when you load the repo, so it's *something* for someone to see. But beyond "show me what this library does in one sentence" I'd really expect someone to read the documentation.
But will think about it.
You're not wrong, it's just that reality is a like that. What code snippets ones seems in the README is usually all your eyes have time to scan.
Granted, if the project is your main at-work project and quality is super important then it might be a different story. So often, it's just one of many projects and the thing you're using a library for might not be a critical thing so you're looking for a quick fix and that's what the code snippets in the README are for.
If you think there are dangers with skimming a snippet like that I would remove it replace it with a link into the "meat of the documentation".
Great sounds good!