Contributed by Jamey Hanson
Adoption of document databases is growing rapidly in response to the need for solutions that can handle large volumes of data. These solutions are adept at handling non-relational data models, but the distinction between database types (document, relational, key-value, etc.) is blurring. Organizations are making broad sweeps of data from multiple sources and storing it in single large documents for later analysis.
PostgreSQL has kept up with evolving storage patterns in the data center with the addition of the JSON data type. PostgreSQL now supports many of the same workloads as MongoDB. Because of this capability, PostgreSQL users are seeking additional knowledge and expertise in working with JSON documents.
Postgres enterprise integration across traditional DBMSs, Hadoop, and NoSQL. Download Now.
This post explores how to create nested JSON documents from relational tables with PostgreSQL 9.5 using a data set of 47,000 airports from http://ourairports.com. (A nod of thanks to the folks at OurAirports.com.) This is the first of two PostgreSQL JSON posts using the airports data set. The second post will explore how to index, query, update and present the JSON documents in PostgreSQL 9.5.
Data Set
OurAirport.com’s data set includes data from approximately 47,000 airports worldwide. There are six, *.csv files corresponding to the tables in the ER diagram below. The files contain UTF8, non-ASCII characters and embedded special characters including single/double quotes and curly-braces.
Airports are uniquely identified by the airport.ident field, which contains the three-letter airport identifier preceded by the country code, where USA is 'K'. For example, Baltimore-Washington International Airport is 'KBWI'. Airports have zero-to-many runways, navaids (navigational aids) and airport_frequencies. I used custom enums for airport.type and various navaids fields because they are useful and simple in PostgreSQL. airports.keywords is an array of values indicating the associated major city. For example Baltimore-Washington International, Reagan National and Dulles all have the keyword 'WAS'. countries and region.keywords are also arrays which are not reflected in the relational model.
I loaded the files using the file foreign data wrapper, file_fdw, because it allows PostGIS and other SQL functions to transform the data during the load. I highly recommend this technique. SQL for loading the data files and creating the custom data types is at the end of the post. The database cluster is PostgreSQL 9.5 with the file_fdw and PostGIS 2.2 extensions running on a laptop CentOS 7.x VM.
Figure 1. Six tables used to create the airport JSON documents
Generate airport JSON documents
The (pretty complicated) SQL below builds airport JSON documents and loads them into a new table airports_json. I created airports_json from a SELECT statement rather than export / import for compactness, but I recommend exporting the data to a file and then importing it into a work environment.
Create the JSON document model
The first step in creating the airports JSON document is to create a document model. The three zero-to-many relationships of airports=<runways, airports=<navaids and airports=<airport_frequencies map naturally to arrays of JSON. The location data for airports and runway leading / trailing edges should be modeled in GeoJSON, the standard for GIS data in JSON. Some of the column names are ambiguous and need to be changed when converting to JSON tags. For example, airport.home_link is changed to airport_home_link for clarity. The three keyword fields could be modeled as strings, like they are in the source data, or arrays in JSON. I chose to model them as arrays because it seems more natural.
The next modeling decision is whether to repeat countries and regions data in each document. The advantage of not repeating the values is better storage efficiency and simpler updates. This is why they are separate tables in the relational model. The advantages of including countries and regions data in each document are completeness and simplicity, which is why I chose to repeat them in each document in this example.
The final modeling decision is whether to remove NULLs (or nulls in JSON-speak) from the final document. The JSON purist approach is always to remove nulls, but relational and hybrid models often keep (some of) them. I removed all nulls in this example.
Make Sure all Needed Data is Available or Can be Generated
The next step in generating JSON is to make sure all the needed data needed for the JSON document is available or easily generated in the relational model. In this example, all six tables plus joins are needed. airports, regions and countries are all related with inner joins. airports to runways, navaids and airport_frequencies are left outer joins.
The location data is provided as latitude_deg / longitude_deg, which is very difficult to turn into GeoJSON. Luckily, PostGIS has a function to create GeoJSON, ST_AsGeoJSON(geometry). This requires that lat/long be converted to geometry, which is easily done as part of the SELECT statement from the external to the internal table, ST_GeomFromText('POINT(' || longitude_deg || ' ' || latitude_deg || ')',4326).
Build the JSON Document from the Inside-out (using JSONB)
NOTE: Use JSONB functions and casting when building JSON documents because PostgreSQL automatically handles control characters, single/double quotes and curly braces, in sub-documents and GeoJSON. PostgreSQL treats JSON (not JSONB) sub-documents as text strings and escape all control characters. This was not self-evident when I started this project.
The final step is to build the JSON document step-by-step from the inside-out starting with the runways, navaids and airport_frequencies embedded JSONB.
The steps for creating the three embedded documents are the same:
1. Create the GeoJSON document and cast it to JSONB
ST_AsGeoJSON(location)::JSONB
2. Create a JSONB document for each row in the respective table with TO_JSONB(anyelement)
NOTE: The syntax for this is awkward, but you can cut-and-paste for this template.
3. Create a JSONB array of the JSONB rows using JSONB_AGG(expression)
4. Add SQL to map the sub-documents to the parent airports.
After the sub-documents are created and related to the parent airports, the next step is to build the outer-level tags with JSONB_BUILD_OBJECT(VARIADIC "any"). This is the step when the three keyword columns are renamed to descriptive JSON tags.
The final step is to remove nulls with JSONB_STRIP_NULLS.
The SQL to generate the JSONB is below:
CREATE TABLE airports_json AS (
SELECT JSONB_STRIP_NULLS(
JSONB_BUILD_OBJECT(
'ident', ident,
'type', type,
'name', air.name,
'location', ST_AsGeoJSON(location)::JSONB,
'elevation_ft', elevation_ft,
'continent', air.continent,
'iso_country', air.iso_country,
'country', cntry.name,
'country_keywords', TO_JSONB(STRING_TO_ARRAY(cntry.keywords, ', ')),
'iso_region', air.iso_region,
'region', reg.name,
'region_keywords', TO_JSONB(STRING_TO_ARRAY(reg.keywords, ', ')),
'municipality', municipality,
'scheduled_service', scheduled_service,
'gps_code', gps_code,
'airport_keywords', TO_JSONB(STRING_TO_ARRAY(air.keywords, ', ')),
'airport_home_link', home_link,
'airport_wikipedia', air.wikipedia_link,
'country_wikipedia', cntry.wikipedia_link,
'region_wikipedia', reg.wikipedia_link,
'runways', (
SELECT JSONB_AGG(TO_JSONB(r))
FROM (
SELECT
le_ident,
he_ident,
length_ft,
width_ft,
surface,
lighted AS is_lighted,
closed AS is_closed,
ST_AsGeoJSON(le_location)::JSONB le_location,
le_elevation_ft,
le_heading_degT,
le_displaced_threshold_ft,
ST_AsGeoJSON(he_location)::JSONB he_location,
he_elevation_ft,
he_heading_degT,
le_displaced_threshold_ft
FROM runways
WHERE airport_ident = air.ident
) r ),
'navaids', (
SELECT JSONB_AGG(TO_JSONB(n))
FROM (
SELECT
name,
filename,
ident,
type,
frequency_khz,
ST_AsGeoJSON(location)::JSONB AS location,
elevation_ft,
dme_frequency,
dme_channel,
ST_AsGeoJSON(dme_location)::JSONB AS dme_location,
dme_elevation,
slaved_variation_deg,
magnetic_variation_deg,
usagetype,
power
FROM navaids
WHERE associated_airport = air.ident
) n ),
'airport_frequencies', (
SELECT JSONB_AGG(TO_JSONB(f))
FROM (
SELECT
type,
description,
frequency_mhz
FROM airport_frequencies
WHERE airport_ident = air.ident
) f )
)) AS airports
FROM airports air
INNER JOIN regions reg ON air.iso_region = reg.code
INNER JOIN countries cntry ON air.iso_country = cntry.code
);
Look at the Output
Take a look at an airports_json document to validate the model and make sure the fields are labeled and converted properly. The SQL and JSON document below are for College Park Airport, the oldest continuously operating airport in the world.
SELECT College Park Airport
SELECT JSONB_PRETTY(airports)
FROM airports_json
WHERE airports ->> 'ident' = 'KCGS';
College Park Airport JSON document
{
"name": "College Park Airport",
"type": "small_airport",
"ident": "KCGS",
"region": "Maryland",
"country": "United States",
"runways": [
{
"surface": "ASP",
"he_ident": "33",
"le_ident": "15",
"width_ft": 60,
"is_closed": false,
"length_ft": 2607,
"is_lighted": true,
"he_location": {
"type": "Point",
"coordinates": [
-76.9192,
38.9779
]
},
"le_location": {
"type": "Point",
"coordinates": [
-76.9253,
38.9833
]
},
"he_elevation_ft": 40,
"he_heading_degt": 318.3,
"le_elevation_ft": 47,
"le_heading_degt": 138.3,
"le_displaced_threshold_ft": 416
}
],
"gps_code": "KCGS",
"location": {
"type": "Point",
"coordinates": [
-76.9223022461,
38.9805984497
]
},
"continent": "NA",
"iso_region": "US-MD",
"iso_country": "US",
"elevation_ft": 48,
"municipality": "College Park",
"airport_keywords": [
"DC3",
"FRZ",
"ADIZ"
],
"country_keywords": [
"America"
],
"region_wikipedia": "http://en.wikipedia.org/wiki/Maryland",
"airport_home_link": "http://www.collegeparkairport.aero/",
"airport_wikipedia": "http://en.wikipedia.org/wiki/College_Park_Airport",
"country_wikipedia": "http://en.wikipedia.org/wiki/United_States",
"scheduled_service": false,
"airport_frequencies": [
{
"type": "A/D",
"description": "POTOMAC APP/DEP",
"frequency_mhz": 119.85
},
{
"type": "AWOS",
"description": "AWOS 3",
"frequency_mhz": 121.225
},
{
"type": "UNIC",
"description": "CTAF/UNICOM",
"frequency_mhz": 122.975
}
]
}
Load RDBMS data
The SQL below was used to create and populate the relational tables.
CREATE SERVER ext_table FOREIGN DATA WRAPPER file_fdw;
-- Countries
CREATE TABLE countries (
id BIGINT PRIMARY KEY,
code CHAR(2) NOT NULL UNIQUE,
name TEXT,
continent CHAR(2),
wikipedia_link TEXT,
keywords TEXT);
COMMENT ON TABLE countries IS 'Data from http://ourairports.com/data/';
CREATE FOREIGN TABLE fdw_countries (
id BIGINT,
code CHAR(2),
name TEXT,
continent CHAR(2),
wikipedia_link TEXT,
keywords TEXT
) SERVER ext_table
OPTIONS (filename '/mnt/hgfs/VMShare/Data/Airports/countries.csv',
format 'csv', header 'true');
INSERT INTO countries (
SELECT * FROM fdw_countries);
-- Regions
CREATE TABLE regions (
id BIGINT PRIMARY KEY,
code VARCHAR(7) NOT NULL UNIQUE,
local_code VARCHAR(4),
name TEXT,
continent CHAR(2),
iso_country CHAR(2) REFERENCES countries(code),
wikipedia_link TEXT,
keywords TEXT);
COMMENT ON TABLE regions IS 'Data from http://ourairports.com/data/';
CREATE FOREIGN TABLE fdw_regions (
id INT,
code VARCHAR(7),
local_code VARCHAR(4),
name TEXT,
continent CHAR(2),
iso_country CHAR(2),
wikipedia_link TEXT,
keywords TEXT
) SERVER ext_table
OPTIONS (filename '/mnt/hgfs/VMShare/Data/Airports/regions.csv',
format 'csv', header 'true');
INSERT INTO regions (
SELECT * FROM fdw_regions);
-- Airports
CREATE TYPE airport_type AS ENUM (
'balloonport',
'closed',
'heliport',
'large_airport',
'medium_airport',
'seaplane_base',
'small_airport');
CREATE TABLE airports (
id BIGINT PRIMARY KEY,
ident VARCHAR(7) NOT NULL UNIQUE,
type airport_type,
name TEXT,
latitude_deg NUMERIC NOT NULL,
longitude_deg NUMERIC NOT NULL,
location GEOMETRY,
elevation_ft INT,
continent CHAR(2),
iso_country CHAR(2) REFERENCES countries(code),
iso_region VARCHAR(7) REFERENCES regions(code),
municipality TEXT,
scheduled_service BOOLEAN,
gps_code VARCHAR(4),
iata_code VARCHAR(3),
local_code VARCHAR(7),
home_link TEXT,
wikipedia_link TEXT,
keywords TEXT);
COMMENT ON TABLE airports IS 'Data from http://ourairports.com/data/';
CREATE FOREIGN TABLE fdw_airports (
id BIGINT,
ident VARCHAR(7),
TYPE airport_type,
name TEXT,
latitude_deg NUMERIC,
longitude_deg NUMERIC,
elevation_ft INT,
continent CHAR(2),
iso_country CHAR(2),
iso_region VARCHAR(7),
municipality TEXT,
scheduled_service BOOLEAN,
gps_code VARCHAR(4),
iata_code VARCHAR(3),
local_code VARCHAR(7),
home_link TEXT,
wikipedia_link TEXT,
keywords TEXT
) SERVER ext_table
OPTIONS (filename '/mnt/hgfs/VMShare/Data/Airports/airports.csv',
format 'csv', header 'true');
INSERT INTO airports (
SELECT
id,
ident,
TYPE,
name,
latitude_deg,
longitude_deg,
ST_GeomFromText('POINT(' || longitude_deg || ' ' || latitude_deg || ')',4326),
elevation_ft,
continent,
iso_country,
iso_region,
municipality,
scheduled_service,
gps_code,
iata_code,
local_code,
home_link,
wikipedia_link,
keywords
FROM fdw_airports);
-- Runways
CREATE TABLE runways (
id BIGINT PRIMARY KEY,
airport_ref BIGINT,
airport_ident VARCHAR(7) REFERENCES airports(ident),
length_ft INT,
width_ft INT,
surface TEXT,
lighted BOOLEAN,
closed BOOLEAN,
le_ident VARCHAR(6),
le_latitude_deg NUMERIC,
le_longitude_deg NUMERIC,
le_location GEOMETRY,
le_elevation_ft INT,
le_heading_degt NUMERIC,
le_displaced_threshold_ft INT,
he_ident VARCHAR(6),
he_latitude_deg NUMERIC,
he_longitude_deg NUMERIC,
he_location GEOMETRY,
he_elevation_ft INT,
he_heading_degt NUMERIC,
he_displaced_threshold_ft INT);
COMMENT ON TABLE runways IS 'Data from http://ourairports.com/data/';
CREATE FOREIGN TABLE fdw_runways (
id BIGINT,
airport_ref BIGINT,
airport_ident VARCHAR(7),
length_ft INT,
width_ft INT,
surface TEXT,
lighted BOOLEAN,
closed BOOLEAN,
le_ident VARCHAR(6),
le_latitude_deg NUMERIC,
le_longitude_deg NUMERIC,
le_elevation_ft INT,
le_heading_degt NUMERIC,
le_displaced_threshold_ft INT,
he_ident VARCHAR(6),
he_latitude_deg NUMERIC,
he_longitude_deg NUMERIC,
he_elevation_ft INT,
he_heading_degt NUMERIC,
he_displaced_threshold_ft INT
) SERVER ext_table
OPTIONS (filename '/mnt/hgfs/VMShare/Data/Airports/runways.csv',
format 'csv', header 'true');
INSERT INTO runways (
SELECT
id,
airport_ref,
airport_ident,
length_ft,
width_ft,
surface,
lighted,
closed,
le_ident,
le_latitude_deg,
le_longitude_deg,
ST_GeomFromText('POINT(' || le_longitude_deg || ' ' || le_latitude_deg || ')', 4326),
le_elevation_ft,
le_heading_degt,
le_displaced_threshold_ft,
he_ident,
he_latitude_deg,
he_longitude_deg,
ST_GeomFromText('POINT(' || he_longitude_deg || ' ' || he_latitude_deg || ')',4326),
he_elevation_ft,
he_heading_degt,
he_displaced_threshold_ft
FROM fdw_runways);
CREATE TYPE navaid_type AS ENUM (
'DME', 'NDB', 'NDB-DME', 'TACAN', 'VOR', 'VOR-DME', 'VORTAC');
CREATE TYPE navaid_usagetype AS ENUM (
'BOTH', 'HI', 'LO', 'RNAV', 'TERMINAL');
CREATE TYPE navaid_power AS ENUM (
'HIGH', 'LOW', 'MEDIUM', 'UNKNOWN');
-- navaids
CREATE TABLE navaids (
id BIGINT,
filename TEXT,
ident TEXT,
name TEXT,
TYPE navaid_type,
frequency_khz BIGINT,
latitude_deg NUMERIC,
longitude_deg NUMERIC,
location GEOMETRY,
elevation_ft INT,
iso_country TEXT,
dme_frequency NUMERIC,
dme_channel TEXT,
dme_latitude_deg NUMERIC,
dme_longitue_deg NUMERIC,
dme_location GEOMETRY,
dme_elevation INT,
slaved_variation_deg NUMERIC,
magnetic_variation_deg NUMERIC,
usagetype navaid_usagetype,
POWER navaid_power,
associated_airport VARCHAR(7) REFERENCES airports(ident)
);
CREATE FOREIGN TABLE fdw_navaids (
id INT,
filename TEXT,
ident TEXT,
name TEXT,
TYPE navaid_type,
frequency_khz BIGINT,
latitude_deg NUMERIC,
longitude_deg NUMERIC,
elevation_ft INT,
iso_country TEXT,
dme_frequency NUMERIC,
dme_channel TEXT,
dme_latitude_deg NUMERIC,
dme_longitude_deg NUMERIC,
dme_elevation INT,
slaved_variation_deg NUMERIC,
magnetic_variation_deg NUMERIC,
usagetype navaid_usagetype,
POWER navaid_power,
associated_airport VARCHAR(7)
) SERVER ext_table
OPTIONS (filename '/mnt/hgfs/VMShare/Data/Airports/navaids.csv',
format 'csv', header 'true');
INSERT INTO navaids (
SELECT
id,
filename,
ident,
name,
TYPE,
frequency_khz,
latitude_deg,
longitude_deg,
ST_GeomFromText('POINT(' || longitude_deg || ' ' || latitude_deg || ')',4326),
elevation_ft,
iso_country,
dme_frequency,
dme_channel,
dme_latitude_deg,
dme_longitude_deg,
ST_GeomFromText('POINT(' || dme_longitude_deg || ' ' || dme_latitude_deg || ')',4326),
dme_elevation,
slaved_variation_deg,
magnetic_variation_deg,
usagetype,
POWER,
associated_airport
FROM fdw_navaids);
-- airport_frequencies
CREATE TABLE airport_frequencies (
id BIGINT PRIMARY KEY,
airport_ref BIGINT,
airport_ident VARCHAR(7) REFERENCES airports(ident),
TYPE TEXT,
description TEXT,
frequency_mhz NUMERIC);
CREATE FOREIGN TABLE fdw_airport_frequencies (
id BIGINT,
airport_ref BIGINT,
airport_ident VARCHAR(7),
TYPE TEXT,
description TEXT,
frequency_mhz NUMERIC
) SERVER ext_table
OPTIONS (filename '/mnt/hgfs/VMShare/Data/Airports/airport-frequencies.csv',
format 'csv', header 'true');
INSERT INTO airport_frequencies (
SELECT * FROM fdw_airport_frequencies);
Congratulations! You are well on your way to mastering the use of JSON in PostgreSQL documents. Stay tuned on the EDB Postgres Blog for part II in this series to learn how to index, query, update and present the JSON documents in PostgreSQL 9.5.
Jamey Hanson is a Data Architect at Freedom Consulting Group, an EnterpriseDB Partner.