PG Phriday: Dearly De-parted

June 09, 2017

With the christening of PG Phriday at its new location, it’s only fitting that we revisit a frequent boon and bane of organized storage: partitioning. When volume and scale reach a certain point that the very laws of physics become a barrier to efficiency, partitions are a crucial tool to escape the otherwise inevitable pile of unmaintainable bits. It’s not feasible to reindex a multi-terabyte table with tens or hundreds of billions of rows. It’s excruciating to excise portions for long-term archival, and all but impossible to perform anything but the most cursory maintenance in general. Huge tables are a partition advocate’s dream.

But until fairly recently, the implementation in Postgres has been… less than savory. While nothing is perfect, Postgres has never actually offered partitioning. That may sound like a bold statement, but consider how the functionality is cobbled together. Let’s build four partitions for 10-million rows worth of data using the “old” approach:

CREATE SCHEMA old_part;
SET search_path TO old_part;

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

CREATE TABLE sensor_log_part_2014 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);

CREATE TABLE sensor_log_part_2015 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2015-01-01' AND
           reading_date < '2016-01-01')    
) INHERITS (sensor_log);

CREATE TABLE sensor_log_part_2016 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2016-01-01' AND
           reading_date < '2017-01-01')    
) INHERITS (sensor_log);

CREATE TABLE sensor_log_part_2017 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2017-01-01' AND
           reading_date < '2018-01-01')    
) INHERITS (sensor_log);

CREATE OR REPLACE FUNCTION sensor_partition()
RETURNS TRIGGER AS $$
BEGIN
  CASE EXTRACT(YEAR FROM NEW.reading_date)
    WHEN 2017 THEN
      INSERT INTO sensor_log_part_2017 VALUES (NEW.*);
    WHEN 2016 THEN
      INSERT INTO sensor_log_part_2016 VALUES (NEW.*);
    WHEN 2015 THEN
      INSERT INTO sensor_log_part_2015 VALUES (NEW.*);
    WHEN 2014 THEN
      INSERT INTO sensor_log_part_2014 VALUES (NEW.*);
  END CASE;

  RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER t_correct_partition
BEFORE INSERT ON sensor_log
   FOR EACH ROW EXECUTE PROCEDURE sensor_partition();

\timing on

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);

Time: 407622.474 ms (06:47.622)

To create something akin to partitioning, we need to combine table inheritance with table constraints to help the planner prune execution trees. Then we need to inject a trigger to redirect data to its final destination. And if the trigger can’t be generalized, it needs to be regularly updated as new partitions arrive. There’s a reason pg_partman and pg_pathman exist, and why it’s so easy to get wrong.

And getting it wrong is seriously painful. Did someone make the mistake of creating the check constraint with an interval calculation? Well the planner replaces calculations with variable substitutions, so now it won’t consider the constraint at all, and all queries on the base table will include all partitions. This is so prevalent, I’ve literally never encountered a partition setup that got it right the first time.

Our particular implementation does it The Right Way(tm), but I’ve got over a decade of experience guiding my steps. So far as performance is concerned, inserting 10-million rows doesn’t take overly long considering this whole process was tested on a 2GB VirtualBox.

Postgres 10 aims to implement an actual native partition syntax and underlying support. Now the query planner can’t be confused, because it knows what the partition columns are, and what the partition rules dictate within very strict boundaries. We’re not misusing various database features in order to approximate some semblance of partitioning functionality anymore.

The new methodology looks something like this:

CREATE SCHEMA new_part;
SET search_path TO new_part;

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
)
PARTITION BY RANGE (reading_date);

CREATE TABLE sensor_log_part_2014
PARTITION OF sensor_log
FOR VALUES FROM ('2014-01-01') TO ('2015-01-01');

CREATE UNIQUE INDEX udx_sensor_log_2014_sensor_log_id ON sensor_log_part_2014 (sensor_log_id);
CREATE INDEX idx_sensor_log_2014_location ON sensor_log_part_2014 (location);
CREATE INDEX idx_sensor_log_2014_date ON sensor_log_part_2014 (reading_date);

CREATE TABLE sensor_log_part_2015
PARTITION OF sensor_log
FOR VALUES FROM ('2015-01-01') TO ('2016-01-01');

CREATE UNIQUE INDEX udx_sensor_log_2015_sensor_log_id ON sensor_log_part_2015 (sensor_log_id);
CREATE INDEX idx_sensor_log_2015_location ON sensor_log_part_2015 (location);
CREATE INDEX idx_sensor_log_2015_date ON sensor_log_part_2015 (reading_date);

CREATE TABLE sensor_log_part_2016
PARTITION OF sensor_log
FOR VALUES FROM ('2016-01-01') TO ('2017-01-01');

CREATE UNIQUE INDEX udx_sensor_log_2016_sensor_log_id ON sensor_log_part_2016 (sensor_log_id);
CREATE INDEX idx_sensor_log_2016_location ON sensor_log_part_2016 (location);
CREATE INDEX idx_sensor_log_2016_date ON sensor_log_part_2016 (reading_date);

CREATE TABLE sensor_log_part_2017
PARTITION OF sensor_log
FOR VALUES FROM ('2017-01-01') TO ('2018-01-01');

CREATE UNIQUE INDEX udx_sensor_log_2017_sensor_log_id ON sensor_log_part_2017 (sensor_log_id);
CREATE INDEX idx_sensor_log_2017_location ON sensor_log_part_2017 (location);
CREATE INDEX idx_sensor_log_2017_date ON sensor_log_part_2017 (reading_date);

\timing on

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);

Time: 133352.829 ms (02:13.353)

There’s a lot going on there, but we can see some rather obvious changes immediately. First, the new table is classified as a partition base as part of its definition. We now have a table that must be partitioned by reading_date. Tables defined like this can not store data, and exist only as a virtual target for queries.

Why would that be? Because a partitioned table without any partitions has no categories or ranges dictating content locations. Postgres needs to know where to store new rows, and where existing rows reside. So our first task became one of adding a partition to fulfill that role.

We did that four times, just as we had for the original method using constraints. Since each new partition subtly describes the storage hierarchy, we don’t need a trigger anymore. Postgres knows the base table isn’t real, and that it needs to transcribe rows into their intended location based on the partition definitions.

The fact Postgres uses internal routing is reflected in the execution times as well, being over three times faster than the trigger-based approach. This means that partitions are now a much better match in performance-sensitive OLTP environments than they were before. So not only are partitions much easier to use in Postgres 10, they’re effectively native performance. Check this out:

CREATE SCHEMA no_part;
SET search_path TO no_part;

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

\timing on

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);

Time: 134562.724 ms (02:14.563)

While we’re here, we should also examine how the planner sees all of this under the hood with a very basic query. Here’s a basic search that straddles a small portion of two partitions, executed against each approach:

EXPLAIN ANALYZE
SELECT count(*)
  FROM old_part.sensor_log
 WHERE reading_date >= '2016-12-30'
   AND reading_date < '2017-01-03';

 Aggregate  (cost=1749.97..1749.98 rows=1 width=8) (actual time=11.428..11.428 rows=1 loops=1)
   ->  Append  (cost=0.00..1658.40 rows=36628 width=0) (actual time=0.018..9.469 rows=34560 loops=1)
         ->  Seq Scan on sensor_log  (cost=0.00..0.00 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
               Filter: ((reading_date >= '2016-12-30 00:00:00'::timestamp without time zone) AND (reading_date < '2017-01-03 00:00:00'::timestamp without time zone))
         ->  Index Only Scan using sensor_log_part_2016_reading_date_idx on sensor_log_part_2016  (cost=0.43..824.95 rows=18226 width=0) (actual time=0.016..3.475 rows=17280 loops=1)
               Index Cond: ((reading_date >= '2016-12-30 00:00:00'::timestamp without time zone) AND (reading_date < '2017-01-03 00:00:00'::timestamp without time zone))
               Heap Fetches: 17280
         ->  Index Only Scan using sensor_log_part_2017_reading_date_idx on sensor_log_part_2017  (cost=0.43..833.45 rows=18401 width=0) (actual time=0.020..3.943 rows=17280 loops=1)
               Index Cond: ((reading_date >= '2016-12-30 00:00:00'::timestamp without time zone) AND (reading_date < '2017-01-03 00:00:00'::timestamp without time zone))
               Heap Fetches: 17280
 Planning time: 0.294 ms
 Execution time: 11.456 ms

EXPLAIN ANALYZE
SELECT count(*)
  FROM new_part.sensor_log
 WHERE reading_date >= '2016-12-30'
   AND reading_date < '2017-01-03';

 Aggregate  (cost=1761.02..1761.03 rows=1 width=8) (actual time=11.559..11.559 rows=1 loops=1)
   ->  Append  (cost=0.43..1668.78 rows=36896 width=0) (actual time=0.015..9.457 rows=34560 loops=1)
         ->  Index Only Scan using idx_sensor_log_2016_date on sensor_log_part_2016  (cost=0.43..824.95 rows=18226 width=0) (actual time=0.015..3.448 rows=17280 loops=1)
               Index Cond: ((reading_date >= '2016-12-30 00:00:00'::timestamp without time zone) AND (reading_date < '2017-01-03 00:00:00'::timestamp without time zone))
               Heap Fetches: 17280
         ->  Index Only Scan using idx_sensor_log_2017_date on sensor_log_part_2017  (cost=0.43..843.83 rows=18670 width=0) (actual time=0.021..3.486 rows=17280 loops=1)
               Index Cond: ((reading_date >= '2016-12-30 00:00:00'::timestamp without time zone) AND (reading_date < '2017-01-03 00:00:00'::timestamp without time zone))
               Heap Fetches: 17280
 Planning time: 0.248 ms
 Execution time: 11.585 ms

There’s something familiar about those two query plans… Oh, that’s right! They’re exactly the same. Well, almost. The only appreciable difference is that Postgres needs to scan the empty parent table in the first case, because it’s an actual table that can contain real data. The only reason that table is empty in our case is a mere implementation detail. In fact, some partitioning styles such as my own tab_tier extension take advantage of the base table as a type of “hot zone” for extremely active data to avoid trigger overhead.

So what if we want to retire a bunch of old data? The old way was to do something like this:

ALTER TABLE old_part.sensor_log_part_2014
   NO INHERIT old_part.sensor_log;

Then we could do whatever we want with the disinherited data. Normally that meant copying it to some tertiary location and dropping the table, but it doesn’t have to. We could alternately have it inherit from some other similar target—a long-term aggregate report base table, for instance.

The equivalent in Postgres 10 is more indicative of what’s going on:

ALTER TABLE new_part.sensor_log
      DETACH PARTITION new_part.sensor_log_part_2014;

Of course, all of this doesn’t come without a caveat or two. First of all, the base table no longer acts as a contextual template for its partitions. We can’t create indexes on the base table, and new partitions inherit nothing aside from column definitions and certain types of constraints. For at least this iteration of the feature, we also can’t use the LIKE syntax to cheat and copy existing indexes from some other table. That means it’s a little more tedious to declare new partitions than it otherwise could be.

Postgres has a funny way of iterating features though, so I suspect at least a few of those shortcomings will be softened in upcoming Postgres releases. I’ve always recommended partitions only be used for extreme situations where maintenance concerns are paramount. There were too many ways they broke down and included far more data than expected, and even extension-augmented partition automation was a hair-raising prospect.

Postgres 10 means I’ll need to reevaluate that stance, because its native partitioning is nothing like the cobbled-together duct tape mascot. Well… at least not from the end user’s perspective.

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020