No, this is not about the new JSON Type added in Postgres 9.2. This is about how you can get a record set from a Postgres database into a JSON string the best way possible using Python.
Here's the traditional way:
>>> import json
>>> import psycopg2
>>>
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor()
>>> cur.execute("""
... SELECT
... id, oid, root, approved, name
... FROM blogcomments
... LIMIT 10
... """)
>>> columns = (
... 'id', 'oid', 'root', 'approved', 'name'
... )
>>> results = []
>>> for row in cur.fetchall():
... results.append(dict(zip(columns, row)))
...
>>> print json.dumps(results, indent=2)
[
{
"oid": "comment-20030707-161847",
"root": true,
"id": 5662,
"name": "Peter",
"approved": true
},
{
"oid": "comment-20040219-r4cf",
"root": true,
"id": 5663,
"name": "silconscave",
"approved": true
},
{
"oid": "c091011r86x",
"root": true,
"id": 5664,
"name": "Rachel Jay",
"approved": true
},
...
This is plain and nice but it's kinda annoying that you have to write down the columns you're selecting twice.
Also, it's annoying that you have to convert the results of fetchall()
into a list of dicts in an extra loop.
So, there's a trick to the rescue! You can use the cursor_factory
parameter. See below:
>>> import json
>>> import psycopg2
>>> from psycopg2.extras import RealDictCursor
>>>
>>> conn = psycopg2.connect('dbname=peterbecom')
>>> cur = conn.cursor(cursor_factory=RealDictCursor)
>>> cur.execute("""
... SELECT
... id, oid, root, approved, name
... FROM blogcomments
... LIMIT 10
... """)
>>>
>>> print json.dumps(cur.fetchall(), indent=2)
[
{
"oid": "comment-20030707-161847",
"root": true,
"id": 5662,
"name": "Peter",
"approved": true
},
{
"oid": "comment-20040219-r4cf",
"root": true,
"id": 5663,
"name": "silconscave",
"approved": true
},
{
"oid": "c091011r86x",
"root": true,
"id": 5664,
"name": "Rachel Jay",
"approved": true
},
...
Isn't that much nicer? It's shorter and only lists the columns once.
But is it much faster? Sadly, no it's not. Not much faster. I ran various benchmarks comparing various ways of doing this and basically concluded that there's no significant difference. The latter one using RealDictCursor
is around 5% faster. But I suspect all the time in the benchmark is spent doing things (the I/O) that is not different between the various versions.
Anyway. It's a keeper. I think it just looks nicer.
Comments
Post your own commentIt isn't really necessary to use a RealDictCursor. Any DB-API cursor ought to return the column name, in cursor.description[0] after an execute() (and psycopg complies with this). See http://www.python.org/dev/peps/pep-0249/#description.
Tried that. It works. It solves the problem of having to manually write down all the columns. However, it doesn't feel as neat as the RealDictCursor solution?
Not sure why you phrased the last sentence interrogatively. "Neat" depends on what you're used to. I tend to use a DictConnection as connection_factory on the psycopg2.connect, which makes all my cursors DictCursors. Haven't tested it, but I believe gives that the same capability as RealDictCursors (as far as json.dumps is concerned), but with added flexibility.
Using a DictConnection implies a DictCursor. But I really want a RealDictCursor otherwise, when you run `json.dumps(cur.fetchall())` it doesn't work. I think when json.dumps() interrogates a DictRow object it, for some reason, serializes it to a list.
That's why RealDictCursor is more impressive to me.
Wouldn't expect it to be faster, really. There's a little bit of CPU saving by not doing the redundant copy, but as you say, most of the activity is I/O. I came to realise years ago that with database application, there's very little point in trying to optimise CPU use in your own code, because it's typically dwarfed by what goes on in the database driver, never mind the actual I/O element.
That said, the change is more than worth it, just for how it simplifies the code.
Right, the biggest optimizations will come from limiting IO, so doing as much as possible on the other end of that pipe, in the DB engine itself: nothing worse than effectively doing a join on the wrong end of the io, not to mention sorts.
As to the posts topic, there is a generic way to take advantage of the native json support in both postgresql and psycopg2: try something like this:
cur.execute('select array_agg(row_to_json(blogcomments)) from blogcomments limit 10')
res=cur.fetchall()
print json.dumps(res[0][0], indent=2)
Aside from the somewhat obscure reasons for the two extra list wrappers, this keeps you columns names/json keys in one place: the db table definition.
MySQLdb.cursors.DictCursor works pretty much the same way if you're using MySQL instead of PostgreSQL. See: http://mysql-python.sourceforge.net/MySQLdb-1.2.2/public/MySQLdb.cursors.DictCursor-class.html
Excellent method thank for sharing this clean way to workaround of adding the column names.
Is there a similar way to do it in reverse from json to a postgres table predefined ?
Thanks Jorge
When I ran your sample on my db table I got;
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: datetime.date(2014, 3, 24) is not JSON serializable
You need to take care of things like datetime.date.
E.g. http://stackoverflow.com/a/27058505
Awesome - thanks!
How to do this if we have date column in postgres? it gives error: TypeError: Object of type time is not JSON serializable
You need a special serializer for the JSON-to-Python or the Python-to-JSON.
See https://stackoverflow.com/questions/11875770/how-can-i-overcome-datetime-datetime-not-json-serializable for tips.