PG Phriday: Uptown Func

June 30, 2017

One of the coolest things about Postgres functions is that they can return rows as if they were a table. Not only is it possible, but creating a set-returning function in Postgres is almost frighteningly trivial when compared to other database engines. In fact, some Oracle users are probably rolling their eyes and muttering “Just use a cursor!” already. Just hear us out!

Postgres functions can return cursors too, but using them afterwards isn’t exactly a friendly experience. Cursors are passed by reference, so the function must either accept a parameter to name the cursor, or it generates something wacky like “<unnamed portal 1>”. After that, cursors can only be used with FETCH instead of SELECT, greatly limiting their utility.

What about views, then? Obviously they can return rows like a table, so why not just write everything as a query and turn it into a view? It’s true that practically any function can be emulated with a sufficiently advanced query. This is especially true since Postgres 9 added CTEs, allowing it to generate step-by-step transformations. Some use cases aren’t well covered by that approach, however.

Stop, Wait a Minute

Consider the Fibonacci Sequence. Here’s a function that iteratively generates the whole sequence up to the Nth value:

CREATE OR REPLACE FUNCTION fibonacci(nLoops INT)
RETURNS SETOF NUMERIC AS
$$
DECLARE
  i INT;
  x NUMERIC := 0;
  y NUMERIC := 1;
  z NUMERIC := 1;
BEGIN
  FOR i IN 1..nLoops LOOP
    RETURN NEXT x;
    x = y;
    y = z;
    z = x + y;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

SELECT max(fibonacci) FROM fibonacci(100);

          max          
-----------------------
 218922995834555169026

Amusingly enough, this particular function works until the NUMERIC datatype itself is exhausted, provided sufficient hardware. Our test VM was able to survive until the 400,000th result before subsequent attempts exhausted our meager 8GB of temporary disk space. We’d print it here, but sadly a number with 83,595 digits won’t fit in the article.

Take a Sip, Sign a Check

Now consider if we had implemented this as a CTE instead. That by itself isn’t difficult, we just need to rewrite things a bit:

WITH RECURSIVE fib(i, a, b) as (
  SELECT 1, 0::NUMERIC, 1::NUMERIC
  UNION ALL
  SELECT i + 1, b, a + b
    FROM fib
   WHERE i = 100
)
SELECT max(a)
  FROM fib;

But how can this be turned into a general-purpose view? We would have to raise the limit of the recursive portion of the CTE to some comfortable (or uncomfortable) upper bound so the caller could limit the results with either a LIMIT or WHERE clause.

This has two problems. The first is that the maximum is arbitrary and may not correspond with desired use patterns. The second is that CTEs produce all results before filters are applied, meaning a lot of wasted effort. CTEs don't really work well in views since the inner components can't be tweaked for specific conditions, yet without them, SQL is stateless. There's also the matter of exception handling. So any time we need a state machine that returns rows, functions are really our only answer.

Don't Brag About it, Come Show Me

The primary limitation to Postgres functions is that the return type must exist before the function is declared. There are currently three ways to do this. Let's go back to our trusty sensor_log table:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

As usual, the table consists of 10-million rows, spread across 1000 locations with 100 possible reading values. In Postgres, every table also exists as a type which represents the table structure. We can see this by executing a completely non-recommended bit of SQL:

DROP TYPE sensor_log;

ERROR:  cannot drop type sensor_log because table sensor_log requires it
HINT:  You can drop table sensor_log instead.

Given the type already exists, we can write a function which returns rows that masquerade as the sensor_log table. Here's a very basic function that merely grabs the top ten readings for a particular location:

CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
RETURNS SETOF sensor_log AS
$$
DECLARE
  rReturn sensor_log;
BEGIN
  FOR rReturn IN
    SELECT *
      FROM sensor_log
     WHERE location = nLocation
     ORDER BY reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT rReturn;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

SELECT * FROM sensor_location_top_ten('15');

 sensor_log_id | location | reading |    reading_date     
---------------+----------+---------+---------------------
       1068015 | 15       |     100 | 2017-02-22 09:17:30
       1255015 | 15       |     100 | 2017-01-31 17:50:50
        497015 | 15       |     100 | 2017-04-29 11:24:10
        894015 | 15       |     100 | 2017-03-14 12:37:30
        248015 | 15       |     100 | 2017-05-28 07:04:10
       1092015 | 15       |     100 | 2017-02-19 14:37:30
        183015 | 15       |     100 | 2017-06-04 19:37:30
         35015 | 15       |     100 | 2017-06-21 22:44:10
       2245015 | 15       |     100 | 2016-10-09 03:50:50
       1767015 | 15       |     100 | 2016-12-03 11:37:30
(10 rows)

Break it Down

But what if we want to permanently remove the location column from the results? We can't do that because there's no type with that definition. Of course, we can make one and then use it instead:

CREATE TYPE sensor_log_noloc AS (
  sensor_log_id  INT,
  reading        BIGINT,
  reading_date   TIMESTAMP
);

CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
RETURNS SETOF sensor_log_noloc AS
$$
DECLARE
  rReturn sensor_log_noloc;
BEGIN
  FOR rReturn IN
    SELECT sensor_log_id, reading, reading_date
      FROM sensor_log
     WHERE location = nLocation
     ORDER BY reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT rReturn;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

Julio, Get the Stretch

Alternatively, Postgres functions can define specific OUT variables. Declared in order, we can use them as a virtual row-set. The major caveat here is that parameter names must not conflict with table column names due to ambiguity concerns. This means we should either rename our parameters, or be particularly meticulous when using table column aliases.

Here's how the function would look if we took the latter approach:

CREATE OR REPLACE FUNCTION sensor_location_top_ten(
  nLocation          VARCHAR,
  sensor_log_id  OUT INT,
  reading        OUT BIGINT,
  reading_date   OUT TIMESTAMP
)
RETURNS SETOF RECORD AS
$$
BEGIN
  FOR sensor_log_id, reading, reading_date IN
    SELECT s.sensor_log_id, s.reading, s.reading_date
      FROM sensor_log s
     WHERE s.location = nLocation
     ORDER BY s.reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

It's a bit more work up front, but it removes the necessity of creating a corresponding type for every function that returns row sets.

Saturday Night, and We in the Spot

Of course, saving the best for last is a pretty classic move, and we're suckers for tradition. Beginning in Postgres 8.4, functions gained one final syntax that did two things that effectively deprecated the previous two approaches. First, we no longer have to declare the type before the function. Second, we can avoid polluting our argument list with return values.

Here's how it looks:

CREATE OR REPLACE FUNCTION sensor_location_top_ten(nLocation VARCHAR)
RETURNS TABLE(
  sensor_log_id INT,
  reading BIGINT,
  reading_date TIMESTAMP
) AS
$$
BEGIN
  FOR sensor_log_id, reading, reading_date IN
    SELECT s.sensor_log_id, s.reading, s.reading_date
      FROM sensor_log s
     WHERE s.location = nLocation
     ORDER BY s.reading DESC
     LIMIT 10
  LOOP
    RETURN NEXT;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

What's interesting about this approach is that it's otherwise identical to using OUT parameters. The principal differences are the previously mentioned reduction in parameter clutter, and that a table-returning function implicitly returns a set of records. In essence, it's basically just OUT-parameter shorthand someone hacked into the Postgres engine.

Don't Believe Me, Just Watch

Postgres functions have a plethora of such conveniences, and like most maturing platforms, it accumulates more with every passing generation. Since Postgres 10 is shaping up to be a rather comprehensive amalgam of varied enhancements, it's exciting to see how functions might evolve. Even if they remain static until some mysterious future version, there's still a lot more for us to explore.

And we most definitely will.

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020