JSON and the Arguments

Now that we’re getting better support for storing JSON (JavaScript Object Notation) in databases, what is it really giving us? Whatever else, the technique of “stuffing data into a JSON blob and adding an index” doesn’t always absolve us from the task of carefully designing and normalizing our database schema.

There are many advantages, I’ll admit, to storing the JSON in the database, since we get the benefit of transactional support, and our JSON data is then point-in-time consistent with the rest of the database. Storing the data as JSON is also, for many developers, a convenient way to “extend” the schema in cases where the attributes that need to be stored can change frequently.

So where is the danger? It comes when someone has the idea that we can select or order rows based on values stored within the JSON BLOB. No! Don’t do it! By putting the JSON string in a column, you have already decided that, from the perspective of the database, it is an atomic value. Dan Robinson demonstrates, in a recent PostgreSQL post, a 2000 times performance degradation when searching on atomic column values in the WHERE clause of a simple 2-table join query, versus querying rows based on attribute values found inside a JSON blob. The root cause turned out to be the fact that PostgreSQL cannot maintain statistics on fields within JSONB columns and so has no way of knowing the selectivity of a predicate or how many rows will be returned. Instead, it uses a built-in estimate that 0.1% of the rows in the table will be retuned, which it will round up to “1”. In such cases, the optimizer will generally fulfil the join condition using a nested loops operation; highly efficient if there really are only a handful of rows returned, but not if there are in fact hundreds or thousands rows, and a performance killer if there is no appropriate index on the “outer” join table.

Now that we have a native JSONB data type, much stronger support for storing and querying JSON, the ability to index the JSON columns, and so on, there are ways of tackling the problem. If you need to select on an attribute buried in a JSON blob, you need to expose it in a persisted and calculated column using a function that is able to extract it. Then you must index it. If anything I’ve just written sounds at all scary, then just remember that if you want to store a JSON blob in the database, and the whole blob is all you ever need to return, then databases such as PostgreSQL and Azure SQL Database can support this pretty well. However, if you’re running queries against JSON data, returning rows based on attribute values found inside a blob, then expect performance issues.