PG Phriday: Crazy Correlated Column Crusade

July 14, 2017

For a long time, the Postgres query planner has sported a huge blinking neon blind-spot that frustrated and enraged DBAs throughout the universe to a level just shy of murderous frenzy. How is this even possible? What terrible lurking horror could elicit such a visceral response from probably the most boring and straight-laced people ever to draw breath? What else? Correlated statistics.

The Adventure Begins!

The Postgres query planner is a cost-estimation engine. Postgres gathers statistics on table contents such as most and least frequent values, value cardinality, rudimentary histograms, and so on. Along with multiple metrics related to hardware responsiveness, it generates multiple viable execution strategies, and chooses the one that probably costs the least resources. So far, this is something nearly all database engines do.

But there’s an area that query planners tend to skip: correlation. After all, if two columns are related in some manner, there’s nearly an infinite number of possible combinations; gathering statistics on all of them would be inadvisable to say the least. Unfortunately, doing so is critical in some cases.

To see why, let’s build a test case that helps illustrate the problem more directly. As usual, this means trotting out the old, reliable sensor_log table:


CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       BIGINT NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

ANALYZE sensor_log;

So now we have 1,000,000 rows spread across 1,000 sensors, tracking up to 100 distinct values. In this particular scenario, the location and reading columns are strongly correlated thanks to our use of the modulo operator. Up to a value of 100, there’s a 1-1 relationship between location and reading.

So far, this is what the Postgres query planner sees:


EXPLAIN
SELECT *
  FROM sensor_log
 WHERE location = 10
   AND reading = 10;

                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log  (cost=19.67..2704.07 rows=10 width=28)
   Recheck Cond: (location = 10)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.66 rows=965 width=0)
         Index Cond: (location = 10)

What’s 1,000,000 divided by 1,000? We should be seeing 1,000 rows in the results, but the estimate is off by two orders of magnitude. Well, that’s pretty embarrassing.

Into the Bottomless Chasm

To see why Postgres mis-estimated the amount of matches, we need to view the statistics Postgres has stored for the sensor_log table:


SELECT attname, n_distinct
  FROM pg_stats
 WHERE tablename = 'sensor_log'
   AND attname IN ('location', 'reading');

 attname  | n_distinct 
----------+------------
 location |       1000
 reading  |        100

Well, that looks correct, right? So what’s the problem? The issue is that Postgres doesn’t know these two values are tightly bound. By default, all predicate probabilities are multiplied independently. So we have a 1/1000 chance multiplied by a 1/100 chance, and we end up with a 1/100000 chance. What’s 1,000,000 divided by 100,000? So it’s no surprise that Postgres only expects to match 10 rows from sensor_log based on our query.

A Hero is Born

That used to be the end of the story. Postgres simply couldn’t handle correlated columns, and would repeatedly botch row estimates where they occurred. In some cases, a good DBA with a lot of optimization expertise could construct a one-time remedy. Unfortunately, it’s not always possible to hand-tune every query. When that happened, everyone simply had to hope the bad estimates weren’t too detrimental, or were exceptionally rare.

Starting with Postgres 10 and the new CREATE STATISTICS statement however, things are a bit different. Now we can specifically inform Postgres of column correlations where relevant, and the results are quite amazing. Let’s create a correlation and attempt the query again:


CREATE STATISTICS stats_sensor_log_location_reading_1t1
       (dependencies)
    ON location, reading
  FROM sensor_log;

ANALYZE sensor_log;

EXPLAIN
SELECT *
  FROM sensor_log
 WHERE location = 10
   AND reading = 10;

                                       QUERY PLAN                                        
-----------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log  (cost=19.94..2714.09 rows=970 width=28)
   Recheck Cond: (location = 10)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
         Index Cond: (location = 10)

Much better! Now when Postgres encounters these two columns in a query, it will have a far more accurate picture of how the data is related.

Following the White Rabbit

Why is this so important? The query plans are the same, and would produce equivalent results regardless of the estimation error. This is only true for extremely simple queries; once we start combining tables, things change drastically.

We only have a single table right now, so let’s join it to itself to simulate what would happen if query complexity increased even slightly:


EXPLAIN
SELECT l1.*
  FROM sensor_log l1
  JOIN sensor_log l2 USING (sensor_log_id)
 WHERE l1.location = 10
   AND l1.reading = 10
   AND l2.location = 10
   AND l2.reading = 10;

                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Hash Join  (cost=2746.16..5452.56 rows=1 width=28)
   Hash Cond: (l1.sensor_log_id = l2.sensor_log_id)
   ->  Bitmap Heap Scan on sensor_log l1  (cost=19.94..2714.09 rows=970 width=28)
         Recheck Cond: (location = 10)
         Filter: (reading = 10)
         ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
               Index Cond: (location = 10)
   ->  Hash  (cost=2714.09..2714.09 rows=970 width=4)
         ->  Bitmap Heap Scan on sensor_log l2  (cost=19.94..2714.09 rows=970 width=4)
               Recheck Cond: (location = 10)
               Filter: (reading = 10)
               ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
                     Index Cond: (location = 10)

We can ignore much of this output. Focus on the count estimates. We have roughly 1000 rows being hash joined with another 1000 rows. Hashing 2000 values isn’t especially intensive, and this just happens to be the optimal query plan. This is precisely what we want to see, and Postgres 10 makes it possible.

Things Fall Apart

So what would have happened in an older, somewhat less sophisticated version of Postgres? We can drop the statistical representation we created and see for ourselves. Be prepared; this is a gruesome exhibition not suitable for some readers:


DROP STATISTICS stats_sensor_log_location_reading_1t1;

EXPLAIN
SELECT l1.*
  FROM sensor_log l1
  JOIN sensor_log l2 USING (sensor_log_id)
 WHERE l1.location = 10
   AND l1.reading = 10
   AND l2.location = 10
   AND l2.reading = 10;

                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=20.13..2791.49 rows=1 width=28)
   ->  Bitmap Heap Scan on sensor_log l1  (cost=19.70..2713.85 rows=9 width=28)
         Recheck Cond: (location = 10)
         Filter: (reading = 10)
         ->  Bitmap Index Scan on idx_sensor_log_location  (cost=0.00..19.70 rows=970 width=0)
               Index Cond: (location = 10)
   ->  Index Scan using sensor_log_pkey on sensor_log l2  (cost=0.42..8.45 rows=1 width=4)
         Index Cond: (sensor_log_id = l1.sensor_log_id)
         Filter: ((location = 10) AND (reading = 10))

Deplorable. Now that Postgres only expects to match about 10 rows, it reverts to a nested loop to join the two tables together. Our particular example is only off by two orders of magnitude at relatively low scale, so the execution times don’t diverge too much. This particular query plan is “only” about 50-60% slower than the optimal version.

Yet queries get more complicated than this, to be sure. Add another predicate, up the scale slightly, and Postgres might use a nested loop for a situation where a merge would have been more appropriate. Consider that merges are normally reserved for very large result sets, and try not to balk in abject terror after realizing a bad estimate here could mean looping through millions of results unintentionally. It’s the difference between a plan that might finish in less than a second, and one that may consume several hours.

Trouble in Paradise

Unfortunately, correlated statistics aren’t quite perfect yet. As a new and somewhat experimental piece of functionality, there are some edge cases it trips over. In past articles, the location column was historically a VARCHAR type for mostly arbitrary reasons. Let’s switch it back and apply the same dependent statistics to see what the planner does:


CREATE TABLE sensor_log_vc (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log_vc (location, reading, reading_date)
SELECT s.id % 1000, id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_vc_location ON sensor_log_vc (location);
CREATE INDEX idx_sensor_log__vc_date ON sensor_log_vc (reading_date);

CREATE STATISTICS stats_sensor_log_vc_location_reading_1t1
       (dependencies)
    ON location, reading
  FROM sensor_log_vc;

ANALYZE sensor_log_vc;

EXPLAIN
SELECT *
  FROM sensor_log_vc
 WHERE location = '10'
   AND reading = 10;

                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log_vc  (cost=19.70..2623.99 rows=9 width=23)
   Recheck Cond: ((location)::text = '10'::text)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_vc_location  (cost=0.00..19.70 rows=970 width=0)
         Index Cond: ((location)::text = '10'::text)

What happened here? Notice how everything is being cast to TEXT by the planner? Normally this is quite innocuous, as TEXT and VARCHAR are roughly equivalent in Postgres. Though it’s plainly obvious that “roughly” does not mean “exactly”, at least to some of the backend code.

If we try the experiment again, but use TEXT for the column type instead, we get the expected result:


CREATE TABLE sensor_log_txt (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       TEXT NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log_txt (location, reading, reading_date)
SELECT s.id % 1000, id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_txt_location ON sensor_log_txt (location);
CREATE INDEX idx_sensor_log__txt_date ON sensor_log_txt (reading_date);

CREATE STATISTICS stats_sensor_log_txt_location_reading_1t1
       (dependencies)
    ON location, reading
  FROM sensor_log_txt;

ANALYZE sensor_log_txt;

EXPLAIN
SELECT *
  FROM sensor_log_txt
 WHERE location = '10'
   AND reading = 10;

                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sensor_log_txt  (cost=19.93..2619.57 rows=968 width=23)
   Recheck Cond: (location = '10'::text)
   Filter: (reading = 10)
   ->  Bitmap Index Scan on idx_sensor_log_txt_location  (cost=0.00..19.68 rows=968 width=0)
         Index Cond: (location = '10'::text)

Hopefully this kind of thing will be ironed out before Postgres 10 is released to the public. If not, it’s still better than what we’ve been forced to endure thus far.

A Welcome Rest

It’s difficult not to be excited by Postgres 10. The development team has pulled out all the stops and is addressing long-standing concerns as if there was no tomorrow. On a more personal level, I’ve ranted and raved about this particular shortcoming on numerous occasions. I’ve had queries that should have executed in seconds literally take hours instead. In the past, the only way to fix these instances was to resort to CTEs or even multiple temp tables, thus forcing Postgres to periodically materialize for accurate row counts when it mattered.

At best, that’s an ugly hack. At worst, it makes a temporary problem permanent barring frequent code refactors. Such workarounds may avoid the catastrophic result, but they’re almost never optimal. CTEs and temp tables aren’t mere optimization hints or fences, they’re optimization walls. When used in situations like this, improvements in the query planner are rendered irrelevant; such walls were constructed expressly to thwart its efforts.

So now we’ve entered an era where bad practices aren’t perpetuated by the Postgres query planner itself. That’s an advancement of Earth-shattering proportions, the implications of which will continue to reverberate through the community in exciting and unexpected ways. There is indeed much about Postgres 10 that has me jumping in anticipation, but this… this is something special.

Well, at least from the perspective of an easily entertained and thoroughly boring DBA. Hopefully I’m not the only one that feels this way.

Share this