PostgreSQL anti-patterns: Unnecessary json/hstore dynamic columns

June 20, 2020

PostgreSQL has json support – but you shouldn’t use it for the great majority of what you’re doing. This goes for hstore too, and the new jsonb type. These types are useful tools where they’re needed, but should not be your first choice when modelling your data in PostgreSQL, as it’ll make querying and manipulating it harder.

Some readers will be familiar with the (anti)-pattern. EAV has been a necessary evil in database schemas for a long time. It’s something we all advise against using and look down on, then resort to when reality and business requirements mean not everything can be fit into rigid fields in properly modelled relations. For those who don’t know it, it’s a schema design where you turn a relational database into a poor-man’s key/value store using a table with object_id (“entity”), key (“attribute”) and value columns. Querying it usually involves lots and lots of self-joins.

json is the new EAV – a great tool when you need it, but not something you should use as a first choice.

(Most of this also applies to PostgreSQL arrays, which are great tools for building queries, but not usually an appropriate way to store authorative data).

JSON blob tables

Since the release of PostgreSQL 9.3, and even more since jsonb in 9.4, I’m seeing more and more Stack Overflow questions where people ask how to do joins on json objects, how to query for a nested key inside any array element, etc. All reasonable things to do, but when asked for schema and example data they’ll often post something like this:

CREATE TABLE people(
    id serial primary key,
    data jsonb not null
);

INSERT INTO people(data) VALUES ($$
{
    "name": "Bob",
    "addresses": [
        {
            "street": "Centre",
            "streetnumber": 24,
            "town": "Thornlie",
            "state": "WesternAustralia",
            "country": "Australia"
        },
        {
            "street": "Example",
            "streetnumber": "4/311-313",
            "town": "Auckland",
            "country": "NewZealand"
        }
    ],
    "phonenumbers": [
        {
            "type": "mobile",
            "number": "12345678"
        }
    ]
}
$$);

INSERT INTO people(data) VALUES ($$
{
  "name": "Fred",
  "phonenumbers": [
    { "type": "mobile", "number": "12345678" }
  ]
}
$$);

INSERT INTO people(data) VALUES ($$
{
  "name": "John Doe"
}
$$);

and ask “how do I find people who have the same phone number?”.

Experienced relational database users will already be wincing… but remember, not everybody is used to inverting their thinking to follow foreign keys backwards from child to parent and decomposing composite entities into normal form. When you think about it, relational databases are weird.

Most of PostgreSQL’s power comes from that relational model, though. When you store your data as json blobs you deny the query planner the ability to make sensible decisions based on table and column statistics, you lose most of the indexing features and scan types, and are generally restricted to quite primitive operations. Doing anything interesting tends to involve lots of self-joins and filters.

Baroque queries

For example, the request to find people with matching phone numbers could be turned into a query like:

select 
  p1.id AS person1,
  p2.id AS person2,
  p1.data ->> 'name' AS "p1 name",
  p2.data ->> 'name' AS "p2 name",
  pns1 ->> 'type' AS "type", 
  pns1 ->> 'number' AS "number" 
from people p1 
  inner join people p2 
    on (p1.id > p2.id)
  cross join lateral jsonb_array_elements(p1.data -> 'phonenumbers') pns1
  inner join lateral jsonb_array_elements(p2.data -> 'phonenumbers') pns2 
    on (pns1 -> 'type' = pns2 -> 'type' AND pns1 -> 'number' = pns2 -> 'number');

Isn’t that just a thing of readable beauty?

There’s nothing particularly wrong with the json support functions and operators, it’s just awkward taking nested object-like data and querying it using relational joins and predicates.

At least PostgreSQL supports LATERAL; without that, this would be a nightmare of a query to write.

As it happens an equivalent query on a relational schema for this particular question isn’t that much nicer:

SELECT
  p1.id AS "person1",
  p2.id AS "person2",
  p1.name AS "p1 name",
  p2.name AS "p2 name",
  pns1.phoneno AS "p1 phone",
  pns2.phoneno AS "p2 phone"
FROM
  person p1
  INNER JOIN person p2 ON (p1.id < p2.id)
  INNER JOIN phone pns1 ON (p1.id = pns1.person_id)
  INNER JOIN phone pns2 ON (p2.id = pns2.person_id)
WHERE
  pns1."type" = pns2."type"
  AND pns1.phoneno = pns2.phoneno;

... but it's likely to go a *lot* faster with an index on (phone."type", phone.phoneno) than the json based query would, and read a lot less data off disk in the process.

No fixed data typing

The json-based query given above is also buggy, because jsonb comparisons are sensitive to data type so the jsonb values "12345678" and 12345678 are unequal:

regress=> SELECT '12345678'::jsonb, '"12345678"'::jsonb, '12345678'::jsonb = '"12345678"'::jsonb AS "isequal";
  jsonb   |   jsonb    | isequal 
----------+------------+---------
 12345678 | "12345678" | f

so this:

insert into people (data) values ('{"phonenumbers": [{"type":"mobile","number":12345678}]}');

won't be identified as a duplicate even though it should be.

Note that this is arguably a PostgreSQL limitation, since a JavaScript intepreter's loose typing means they compare as equal:

> 12345678 == "12345678"
> true

To get the numbers to match using a PostgreSQL query, we have to change our query so that this:

on (pns1 -> 'type' = pns2 -> 'type' AND pns1 -> 'number' = pns2 -> 'number');

uses the ->> json-value-as-text operator instead:

on (pns1 -> 'type' = pns2 -> 'type' AND pns1 ->> 'number' = pns2 ->> 'number');

since the text form of the numbers is the same.

No constraints

Of course, someone might insert:

insert into people (data) values ('{"phonenumbers": [{"type":"Mobile","number":"1234 5678"}]}');

... which won't match because of the capital M and the space in the number.

Because everything is in a json blob can't easily add CHECK constraints on the value, use ENUM types or DOMAINs, create trigger-maintained shadow normalized columns, or any of the usual approaches used to handle data normalisation like this. It'd be necessary to write a json validator/transform trigger in a procedural language like pl/v8 or do it all in the application.

So when should json be used?

From the above you might be thinking that I'm against using json types in the database. That's not the case at all. They're often better than EAV when you have dynamic-column data that just won't fit into a traditional relational model. Or, as I've seen people try in the past, co-ordinate with an external MongoDB for the json storage.

For example, sometimes application users want to be able to add arbitrary attributes to an entity. It's a business requirement. The client don't like it when you say that the database doesn't do that so they can't have notes in their application, and they're muttering things about "just doing it in the application" or "we didn't have these problems with MongoDB".

So you might model the typical stuff relationally (albeit with indiscriminate use of surrogate keys), but add a json field in person just for those extras so the app can fetch them when displaying a person.

How to decide when to use json

Use json if your data won't fit in the database using a normal relational modelling. If you're choosing between using EAV, serializing a Java/Ruby/Python object into a bytea field, or storing a key to look up an external structured object somewhere else ... that's when you should be reaching for json fields.

At this point it's probably worth replacing hstore use with jsonb in all new applications.

Note that plain json fields are also useful - if you're not going to be indexing the json and querying within it, they're usually more compact and faster to send and receive.

Share this