Yet Another JSON parser for PostgreSQL

August 14, 2024

Here's the story of how we ended up with two JSON parsers in our code.

First some history.

Back in 2012 when we first introduced the JSON data type, Robert Haas wrote a simple hand-cut stack based parser. It wasn't designed to do very much, other than to validate that the JSON complied with the JSON specification. One year later, I needed some facilities to do all sorts of processing of JSON, e.g. to extract data from it. To manage that I needed a different sort of parser, one following a well known pattern called Recursive Descent. This is just about the simplest parser recipe there is. You don't need a parser generator like bison,  and in the JSON case the code was (and is) pretty small. The advantage of the new parser was that it was easy to plug into it semantic actions to take at various critical points during the parsing process. And that's how I was able to add the functionality I mentioned. And that's where things stayed for nearly ten years.

Fast forward to 2022. Robert was working on something else that's taken a lot of his time lately: incremental backup. PostgreSQL backup manifests are JSON documents, and they can get pretty huge. He wanted a way to process manifests incrementally rather than all at once. The existing JSON parser couldn't process JSON in chunks, you have to give it the whole JSON document or it errors out. Robert dropped me a message asking if we could modify that.

On looking into the problem I very quickly came to the conclusion that there was no way to modify the existing parser to handle incremental parsing. The problem is that the parser state in a Recursive Descent parser is effectively encapsulated on the call stack. If we interrupt the parser because we have come to the end of the chunk of data, there is no real way to put back that state when we pass in the next chunk so it can keep going as if there hadn't been a break in processing. Even if we could do it things would be very messy and error prone. I became convinced we needed a new parser.

Fortunately, there is a suitable if somewhat less known parser pattern. It's documented in the venerable Dragon Book at algorithm 4.3 under the heading "Nonrecursive Predictive Parsing". This method handles the same class of grammars that you can use with Recursive Descent parsers, but instead of implicitly storing state on the call stack it uses an explicit stack. That means that if it comes to the end of a chunk of input and returns, as long as we have saved its stack we can resume parsing the next chunk where we left off. Thus that takes care of the incremental parse issue.

So I implemented this type of parser. There was some supplemental work that needed to be done with the lexer (the thing that recognizes individual tokens in a JSON document), especially if it came to the end of the input possibly in the middle of a token. But this was manageable.

My first plan was simply to replace the Recursive Descent parser with this type of parser. But then I came to a problem. When I tested it the Nonrecursive parser was simply slower than the Recursive Descent parser. That's not entirely surprising. CPUs are very adept at setting up and unwinding call stack frames with very few instructions. I, along with a couple of others, looked at ways of eliminating this difference, but in the end it's still about 60% slower using the Nonrecursive parser.

In the end we simply made a decision to carry both parsers. The Recursive Descent parser is used everywhere except where we are parsing backup manifests, for which we use the new parser. This isn't a use that is super performance critical, so it's  worth the speed cost to be able to parse huge manifests. And other use cases for incrementally parsing JSON are in the works.

This work will be in PostgreSQL 17, thanks in part to Jacob Champion who was a considerable help in debugging and improving the testing regime.

Share this