NOTE: This post uses a subset of ~10,000 JSON song files from the Million Song Database (MSD) project. Beware, this is a fun and addictive dataset to work with JSON files!
JSON has fast become the de facto language of data interchange. It is a simple way for data to be represented as it is passed around between various services. The PostgreSQL community introduced support for the JSON data type a few years ago (with v9.2) and recently improved it (with v9.4) by adding a binary storage format (JSONB).
As this core Postgres technology has evolved, we have seen more and more users storing JSON data in Postgres for uses as varied as financial risk management applications, sensor data collection, change history tracking information, or simply just passing large data sets around.
The use of JSON and JSONB is gathering momentum as more and more developers grasp the potential. When I was presenting on integrating NoSQL data types into PostgreSQL earlier this year, a member of the audience suggested we pose the following question to audience: "Who here does not have at least one JSONB field in their database?" (Just over half of the people said they have no JSONB). He went on to assert that that in a year or two, nearly everyone would say "Yes, I have JSONB in my relational model."
I agree wholeheartedly. But that brings up a serious problem for our existing application code and design architectures. Namely, "How can we present traditional PostgreSQL relations (tables and views) to Hibernate, JPA-applications and existing APIs while using JSONB?"
How to: JSONB in Action
A solution we did for one customer, shown in this post, was to create views and/or materialized views using PostgreSQL built-in JSONB operators. The rest of this post goes through the SQL to load the MSD subset and present the results as a view for Hibernate. (NOTE: Additional details and examples are in the presentation from PG Con 2015.)
First, create a symbolic link from the MSD JSON directory to Then create a view of the JSON files in the ExtFiles directory and, finally, generate the SQL COPY commands to load each file into the table.
-- Create a view of all the *.json files under $PGDATA/ExtFiles
CREATE OR REPLACE VIEW json_files AS (
WITH RECURSIVE recursive_files(filename, isdir) AS (
SELECT filename AS filename,
(pg_stat_file(filename)).isdir
FROM pg_ls_dir('.') f(filename)
UNION
SELECT filename || '/' || filename2,
(pg_stat_file(filename || '/' || filename2)).isdir
FROM recursive_files, pg_ls_dir(recursive_files.filename) f(filename2)
WHERE recursive_files.isdir is True
)
SELECT filename FROM recursive_files
WHERE isdir is false
AND filename ILIKE 'ExtFiles%.json'
);
-- Create the j_songs destination table
CREATE TABLE j_songs (
id SERIAL PRIMARY KEY,
song JSONB
);
-- Generate the COPY commands to load each file
SELECT 'COPY nosql.j_songs(song) FROM ''' || filename ||
''' CSV QUOTE e''\x01'' DELIMITER e''\x02'';'
FROM json_files;
The song JSON files are structured with a track_id, title and artist followed by two arrays:
tags + weight (think 'rock':80, 'alternative':90, 'blues':50)
similars + weight (similar track_ids with weights from 0-100)
Once the JSON files are COPYd into j_songs, create an index to speed the view query.
-- DROP INDEX idx_j_song_song;
CREATE INDEX idx_j_song_song ON j_songs USING gin (song);
Next create the view, v_songs.
-- DROP VIEW v_songs;
CREATE OR REPLACE VIEW v_songs AS
SELECT
song ->> 'track_id' AS track_id,
song ->> 'artist' AS artist,
song ->> 'title' AS title
FROM j_songs;
-- Now our JSONB data appears as a traditional table
SELECT *
FROM v_songs
LIMIT 10;
A slightly faster alternative is to create a materialized view with (traditional) indexes of the JSONB field as shown below.
-- DROP MATERIALIZED VIEW mv_songs;
CREATE MATERIALIZED VIEW mv_songs AS
SELECT
song ->> 'track_id' AS track_id, -- the ->> operator returns TEXT
song ->> 'artist' AS artist,
song ->> 'title' AS title
FROM j_songs;
-- Refresh the Mview when the JSONB records change
REFRESH MATERIALIZED VIEW mv_songs;
-- Add indexes ... just like a traditional table
-- DROP INDEX idx_mv_songs_track_id;
CREATE INDEX idx_mv_songs_track_id ON mv_songs(track_id);
-- EXPLAIN ANALYZE -- 0.034 ms. Wicked fast!
SELECT *
FROM mv_songs
WHERE track_id = 'TRAXLZU12903D05F94';
This combination of JSONB data type presented with a traditional view (or materialized view) leverage the power of JSON – and potentially other NoSQL data types – while preserving the relational data presentation required by Hibernate, other JPAs and a lot of legacy code.
To learn more about Postgres and how to optimize key features like JSON/JSONB support, please contact us.
Jamey Hanson is a PostgreSQL data architect at Freedom Consulting Group, a partner of EnterpriseDB in the government sector.