Query Plan Improvement with Expression Indexes in PostgreSQL

Most people know that Postgres allows the creation of indexes on expressions. This is helpful if you need index lookups of expressions used in where clauses.

However, there is another benefit to expression indexes, and that is optimizer statistics. Not only do expression indexes allow rapid lookups of matching expressions, but they also provide optimizer statistics, which improve row estimates and hence query plans.

Here is an example:


        CREATE TABLE test (x INTEGER);
        INSERT INTO test SELECT x FROM generate_series(1, 100) AS t(x);
        
        ANALYZE test;
        
        SELECT COUNT(*) FROM test WHERE x % 2 = 1;
         count
        -------
            50
        
        EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
                             QUERY PLAN
        ----------------------------------------------------
         Seq Scan on test  (cost=0.00..2.50 rows=1 width=4)
           Filter: ((x % 2) = 1)

The optimizer doesn't know the selectivity of the modulus operator, so it initially assumes only one row is returned. Once an expression index is created and analyze statistics generated, the optimizer knows exactly how many rows will be returned:


        CREATE INDEX i_test ON test ((x % 2));
        
        ANALYZE test;
        
        EXPLAIN SELECT * FROM test WHERE x % 2 = 1;
                             QUERY PLAN
        -----------------------------------------------------
         Seq Scan on test  (cost=0.00..2.50 rows=50 width=4)
           Filter: ((x % 2) = 1)

Interestingly, the optimizer used expression index statistics, even though the expression index itself was not used. In the example above, the modulus operator is not selective enough to make the index useful, but expression statistics would be useful for more complex queries, e.g. with joins. This method can also be used to create statistics on functions.

It is also possible to create an expression index that generates cross-columns statistics, the benefits of which I mentioned in an earlier blog post. For example, this expression index would supply accurate statistics for state/city combinations, but queries would need to use the exact concatenation construction:


        CREATE INDEX i_customer_state_city ON customer ((state || '|' || city));

It would be nice if there was a way to create expression statistics without the overhead of creating and maintaining indexes.

This post originally appeared on Bruce's personal blog.