In our last article about cursors in PostgreSQL, we talked about Common Table Expressions (CTE). Today, we continue to discover new alternatives to cursors by using a lesser known feature of PostgreSQL.
We’ll use the data that we imported in the previous article (linked above). I’ll wait a moment for you to follow the procedure there.
Got that? Ok.
The data is a taxonomy chart of the natural world. As a reminder from basic high school biology, that data is organized by Carl Linnaeus into Kingdom, Phylum, Class, Order, Family, Genus and Species. Of course, science has moved forward ever so slightly in the last 250 years, so the taxonomic chart is 21 levels deep. We find the hierarchy tree in a table that is (unsurprisingly) called itis.hierarchy
.
The topic of this article is how to use ltrees in PostgreSQL. Specifically, how to use them for traversing a complex recordset very efficiently. In that sense, we can consider them another surrogate for cursors.
The data is not curated (unfortunately for us) in an ltree format, so we’re going to transform it a bit for the sake of the article.
First, you’ll have to install ltree in the database that you are using to follow this article. Of course, you need to be a super user to install extensions.
CREATE EXTENSION IF NOT EXISTS ltree;
Now we’ll use this extension to provide some very efficient lookups. We’re going to need to transform the data into a lookup table. To perform this transformation, we’re going to use the CTE technique that we covered in the last article. Along the way, we are going to add the Latin names and the English names to the taxonomy tree. This will help us look up items by number, Latin names, or English names.
<?prettify lang-sql linenums=1?>
-- We need a little helper function to strip out illegal label names.
CREATE OR REPLACE FUNCTION strip_label(thelabel text)
RETURNS TEXT
AS $$
-- make sure all the characters in the label are legal
SELECT SELECT
regexp_replace(
regexp_replace(
regexp_replace(
regexp_replace(
-- strip anything not alnum (yes, this could be way more accurate)
thelabel, '[^[:alnum:]]', '_','g'),
-- consolidate underscores
'_+', '_', 'g'),
-- strip leading/trailing underscores
'^_*', '', 'g'),
'_*$', '', 'g');
$$
LANGUAGE sql;
CREATE MATERIALIZED VIEW itis.world_view AS
WITH RECURSIVE world AS (
-- Start with the basic kingdoms
SELECT h1.tsn, h1.parent_tsn, h1.tsn::text numeric_taxonomy,
-- There is no guarantee that there will be a textual name
COALESCE(l1.completename,h1.tsn::text,'')::text latin_taxonomy,
-- and again no guarantee of a common english name
COALESCE(v1.vernacular_name, lower(l1.completename),h1.tsn::text,'unk')::text english_taxonomy
FROM itis.hierarchy h1
LEFT JOIN itis.longnames l1
ON h1.tsn = l1.tsn
LEFT JOIN itis.vernaculars v1
ON (h1.tsn, 'English') = (v1.tsn, v1.language)
WHERE h1.parent_tsn = 0
UNION ALL
SELECT h1.tsn, h1.parent_tsn, w1.numeric_taxonomy || '.' || h1.tsn,
w1.latin_taxonomy || '.' || COALESCE(strip_label(l1.completename), h1.tsn::text,'unk'),
w1.english_taxonomy || '.' || strip_label(COALESCE(v1.vernacular_name, lower(l1.completename), h1.tsn::text, 'unk'))
FROM itis.hierarchy h1
JOIN world w1
ON h1.parent_tsn = w1.tsn
LEFT JOIN itis.longnames l1
ON h1.tsn = l1.tsn
LEFT JOIN -- just change this to "itis.vernaculars v1" to allow mulitples and all languages. (Millions of records.)
(SELECT tsn, min(vernacular_name) vernacular_name FROM itis.vernaculars WHERE language = 'English' GROUP BY tsn) v1
ON (h1.tsn) = (v1.tsn)
)
SELECT w2.tsn, w2.parent_tsn, w2.numeric_taxonomy::ltree, w2.latin_taxonomy::ltree latin_taxonomy, w2.english_taxonomy::ltree english_taxonomy
FROM world w2
ORDER BY w2.numeric_taxonomy
WITH NO DATA;
Let’s stop for a moment and smell the flowers in this query. For starters, we created it without populating any data. This gives us a chance to take care of any syntactical issues before generating a lot of useless data. We are using the iterative nature of the common table expression to put together a pretty deep structure here, and we could easily extend it to cover more languages by adding data to the vernaculars table. The materialized view also has some interesting performance characterics. It will truncate and rebuild the table whenever a REFRESH MATERIALIZED VIEW
is called.
What we’re going to do next is refresh our world view. Mostly because it is healthy to do that from time to time. But in this case, what it actually does is populate the materialized view with data from the itis
schema.
REFRESH MATERIALIZED VIEW itis.world_view;
This is going to take a few minutes to create the 600K+ rows from the data.
The first few rows will look like this:
┌────────────┬─────────┬───────────────────────────────────────────────────────────────────────────────┐
│ parent_tsn │ tsn │ english_taxonomy │
├────────────┼─────────┼───────────────────────────────────────────────────────────────────────────────┤
│ 768374 │ 1009037 │ animals.bilateria.protostomia.ecdysozoa.arthropods.hexapods.insects.winged_in…│
│ │ │…sects.modern_wing_folding_insects.holometabola.ants.ants.aculeata.apoid_wasps…│
│ │ │….cicadakillers.crabroninae.larrini.gastrosericina.gastrosericus.gastrosericus…│
│ │ │…_xanthophilus │
│ 768374 │ 1009038 │ animals.bilateria.protostomia.ecdysozoa.arthropods.hexapods.insects.winged_in…│
│ │ │…sects.modern_wing_folding_insects.holometabola.ants.ants.aculeata.apoid_wasps…│
│ │ │….cicadakillers.crabroninae.larrini.gastrosericina.gastrosericus.gastrosericus…│
│ │ │…_zoyphion │
│ 768374 │ 1009039 │ animals.bilateria.protostomia.ecdysozoa.arthropods.hexapods.insects.winged_in…│
│ │ │…sects.modern_wing_folding_insects.holometabola.ants.ants.aculeata.apoid_wasps…│
│ │ │….cicadakillers.crabroninae.larrini.gastrosericina.gastrosericus.gastrosericus…│
│ │ │…_zyx │
│ 768216 │ 768387 │ animals.bilateria.protostomia.ecdysozoa.arthropods.hexapods.insects.winged_in…│
│ │ │…sects.modern_wing_folding_insects.holometabola.ants.ants.aculeata.apoid_wasps…│
│ │ │….cicadakillers.crabroninae.larrini.gastrosericina.holotachysphex │
│ 768387 │ 1009040 │ animals.bilateria.protostomia.ecdysozoa.arthropods.hexapods.insects.winged_in…│
│ │ │…sects.modern_wing_folding_insects.holometabola.ants.ants.aculeata.apoid_wasps…│
│ │ │….cicadakillers.crabroninae.larrini.gastrosericina.holotachysphex.holotachysph…│
│ │ │…ex_holognathus │
└────────────┴─────────┴───────────────────────────────────────────────────────────────────────────────┘
In a taxonomy, the graph would look something like this:
Of course, it would actually be 21 levels deep, and 600K+ records total.
Now we get to the fun part! ltrees provide a way to do some very complex queries on a hierarchy. The help for that is in the PostgreSQL documentation, so we won’t go into it very deeply here. For a (very quick) understanding, each segment of an ltree is called a label. So, this ltree kingdom.phylum.class.order.family.genus.species
has 7 labels.
Queries against an ltree use a special notation that is like regular expressions in a limited form.
Here’s a simple example: Animalia.*.Homo_sapiens
So a query to find humanity in the world would look like this:
SELECT tsn, parent_tsn, latin_taxonomy, english_taxonomy
FROM itis.world_view WHERE latin_taxonomy ~ 'Animalia.*.Homo_sapiens';
Which results in the expected:
┌────────┬────────────┬────────────────────────────────────────────────┬─────────────────────────────────────────────┐
│ tsn │ parent_tsn │ latin_taxonomy │ english_taxonomy │
├────────┼────────────┼────────────────────────────────────────────────┼─────────────────────────────────────────────┤
│ 180092 │ 180091 │ Animalia.Bilateria.Deuterostomia.Chordata.Vert…│ animals.bilateria.deuterostomia.chordates.v…│
│ │ │…ebrata.Gnathostomata.Tetrapoda.Mammalia.Theria…│…ertebrates.gnathostomata.tetrapoda.mammals.…│
│ │ │….Eutheria.Primates.Haplorrhini.Simiiformes.Hom…│…theria.eutheria.primates.haplorrhini.simiif…│
│ │ │…inoidea.Hominidae.Homininae.Homo.Homo_sapiens │…ormes.hominoidea.Great_Apes.African_apes.ho…│
│ │ │ │…minoids.Human │
└────────┴────────────┴────────────────────────────────────────────────┴─────────────────────────────────────────────┘
Of course, PostgreSQL would never leave it at this. There is an extensive set of operators, Indexes, Transformations, and Examples.
Go take a look at the vast array of capabilities that this technique unlocks.
Now imagine this technique applied to other complex data types such as part numbers, vehicle identification numbers, bill of materials structures or any other classification system. It is not necessary to expose this structure to the end user because of the prohibitively complex learning curve to use it directly. But it is entirely possible to build a “lookup” screen based on a structure like this that is very powerful and hides the complexity of the implementation.
For our next article in the series, we will be exploring the use of plug in languages. In the context of finding alternatives to cursors in PostgreSQL, we will use a language of our choice to model the data in the most appropriate way for our needs. See you next time!