PG Phriday: 10 Things Postgres Could Improve - Part 2

June 19, 2020
PG Phriday: 10 Things Postgres Could Improve - Part 2

Last week, we examined Postgres XID wraparound complications in greater depth to see exactly how easily they can surprise even prepared enterprises. But we also found the real problem areas and how to mitigate them specifically.

In this continuing series to plumb the depths of Postgres constructively, we’re going to focus on some of the ways Postgres replication may either work nonintuitively, or result in complications. And just like last time, we’ll discuss just how to address the situation so we can get the most of our Postgres database cluster.

Let’s get started.

Framing Postgres Replication

Back in the Bad Old Days before 9.0, Postgres didn’t have replication at all. The only thing that existed was continuous crash recovery, accomplished by applying WAL files shipped from an upstream Primary node.

In many ways, even the most current Postgres 12 version still reflects that model. All write activity is first written to the WAL, which replicas can now stream directly from the Primary node. Some under-the-hood modifications also made it possible to read a instances that were in the middle of WAL apply, but it is still very much a continuous crash-recovery apply mechanism.

This has some implications both fortunate and less so, however.

Hey, You Missed a Spot!

The default Postgres replication model is an asynchronous one, meaning that any data not received by a replica may be permanently lost. Some have suggested that synchronous replication is a solution to this, but that is only the case provided at least two replicas are present.

Why exactly is that? Try this experiment:


-- On the Primary node

CREATE TABLE foo ( id int );

-- Stop the synchronous standby. Then:

BEGIN;
INSERT INTO foo VALUES (1);
COMMIT;

The COMMIT should hang at this point. Now cancel the transaction, and select from the table. The supposedly canceled transaction is alive and well, despite having no acknowledgment from the synchronous standby. Even though the Postgres client refuses further action until the previous transaction is acknowledged, that didn’t stop it from occurring.

This is actually the expected behavior since Postgres uses WAL streaming for replication. In order for the transaction to be replayed on a replica, it must first exist in the stream. Ostensibly then, the transaction COMMIT record must also appear in the WAL before being consumed by the standby, even if it is synchronous.

Additionally, Postgres does not use Quorum during any of this. There is configuration syntax to demand that multiple servers acknowledge a transaction, but that still doesn’t prevent it from being committed locally. It only prevents further transactions from that particular client session.

As a result, the only safe way to use synchronous replication is to deploy at least two such replicas, and then hope only one goes down at a time. Otherwise, it is generally not possible to prevent lost data; if the Primary irrecoverably crashes while the Standby is also down, any transactions committed while the Standby was unreachable also vanish into the ether.

Ebb and Flow

Surely the window for actual data loss is small in that edge case. Only one transaction from every currently connected session may have been written on the lost instance. That’s no real consolation in some contexts however, and Postgres has no real answer for this particular scenario.

The easiest way to address this deficiency is to always maintain at least one additional active synchronous stream than is strictly necessary to match the desired specification. If the design model calls for a minimum of two synchronous streams, use three.

The additional streaming node in this case should always be a backup. Postgres itself ships with the pg_receivewal utility which also supports synchronous mode via the --synchronous argument. It’s important to create a replication slot and keep a synchronous stream running to avoid any gaps in the WAL stream. Files received this way should be archived long-term, and are then available for PITR purposes.

Given the coordination necessary here, we designed Barman to perform these tasks in addition to managing backups. A sufficiently configured Barman server will not only keep pg_receivewal running, but also create a replication slot in cases of failover to a different node, and catalogs the files themselves for easy retrieval during a recovery. Barman also uses Postgres streaming replication, so is one of the only Postgres backup tools that could possibly reach RPO-zero, the concept that no data is ever lost during a system outage.

In that case, we wouldn’t need two fully qualified synchronous Postgres replicas in a cluster. We could fulfill the same write guarantee with one regular database instance and a Barman server, which we may need anyway if we want to follow best practices for disaster recovery.

Drinking From the Fire Hose

Since Postgres replication is WAL-based, everything written to the WAL is also written to the replication stream. This includes any ad-hoc maintenance activity such as VACUUM FULL, CLUSTER, or REINDEX, among others. These produce a burst of activity that must be replayed across every other physical replica.

It also means any data corruption on the Primary node always reaches those same replicas. But this is no surprise. Postgres physical replicas are essentially backups that are in continuous WAL recovery and also happen to remain online for read access. Except for hint bits, streaming replicas are byte-for-byte exact copies of the primary node. And if the wal_log_hints setting is enabled, or the database was created with checksums enabled, even those minutiae join the fire-hose.

Taking a Sip Instead

Indeed, logical replication can address both of those particular physical replication shortcomings. The data tuple or statement itself is replicated rather than all of the related page modifications. This Provider->Subscriber model decouples the binary dependence between the two nodes, and even makes it possible to mix otherwise incompatible data backends. Logical replication is the key to in-place zero-downtime major-version upgrades. What’s not to love?

There is unfortunately, one rather conspicuous caveat to those statements. Postgres logical replication requires the use of replication slots to carefully track the LSN replay position of the subscribing system. However, even since their introduction in Postgres 9.4, these are still not copied to physical replicas, nor retained during a backup. This means if a Primary server has a Standby replace it, the last replicated position on all subscribed logical replicas is lost.

The subscriber itself retains the last valid origin information in the pg_replication_origin_status view. This means we could recreate the slot to enable the subscriber to continue streaming WAL files.

Maybe.

Slots work by preventing WAL recycling. If enough transaction volume pushes the LSN beyond the position required by the subscriber before it can reconnect, replication can never resume because it’s not possible to create slots for past WAL positions. This makes sense, because those past WAL files were likely recycled once they were no longer required.

This also reveals another somewhat uncomfortable fact about replication slots: they can’t be rewound even if the required WAL files have not been recycled. Get a connection string wrong and accidentally fast-forward a slot on the wrong server with something like wal2json? Too bad.

While logical replication provides Postgres with some powerful capabilities such as online upgrades, it’s not yet a real replacement for physical replication.

Plugging Holes in the Dam

So what is?

Physical replicas can revert to manually processing WAL files rather than streaming them from some origin system. Logical replicas do not have this option.

Why not? Postgres logical WAL replication works by using the origin Postgres instance to decode the WAL information via the local system catalog. This converts table and schema identities into their fully qualified equivalent. Rather than 3249.98791, we see a table named foo.bar instead. This is necessary because the Postgres catalog follows the same transaction rules as anything else. Hence if we rename foo.bar to foo.bun, the table name will be accurate based on when it is decoded.

Obviously a subscriber system can’t do this; it has no idea which exact transaction may have modified a table’s name or column structure. Since decoding can only occur on an online origin node, merely having the related WAL files available to the subscriber system is insufficient.

In this case, we can do a couple of things to save ourselves. While replication slots are ostensibly a replacement for wal_keep_segments, this isn’t true in practice thanks to slots not being reflected in backups or Standby nodes.

In this case, we can follow the technique I outlined in Terrific Throughput Tracking and actually calculate our WAL production rate. If our failover process usually takes one or two minutes, we want to account for up to 10 times that much.

For example, if we normally produce 320MB of WAL traffic in two minutes (20 WAL files), we would want to set wal_keep_segments to 200. Thus our automated systems would have up to 20 minutes to recreate the replication slot before it would become necessary to rebuild any tables dependent on logical replication.

Sealing the Last Leak

What Postgres really needs to improve its logical replication situation is a minimum of two things:

  1. A mechanism for relaying replication slot information to Standby nodes.
  2. Some way of adding extra WAL contents such that object names exist in decoded form within the WAL file itself, or as a supplementary transaction pinned decode mapping.

So long as logical replication must depend on an online, living and breathing node, using information that literally cannot be backed up, it will remain essentially a novelty.

2ndQuadrant has addressed item 1 in our pglogical 3 extension. A pglogical 3 background worker can be configured to regularly copy slot positions from the upstream Primary for specified replication slots. Unfortunately this functionality is only available to commercial customers, and doesn’t solve the underlying problem for users of any community version of Postgres.

The second issue is something that has yet to be approached. WAL must be decoded by the node that produced it, yet slots only move forward. Combined with the fact a logical subscriber can’t decode the files for themselves, the only remaining option is to rebuild the logical subscriber itself. Depending on database sizes, that can be extremely resource and time intensive.

This is a weakness in Postgres that hearkens back to the ancient days before pg_upgrade made it possible to upgrade a Postgres database without performing a full dump/restore cycle. The thankless, resource-intensive, and time-consuming task of rebuilding a logical replica is currently taken for granted. Our users are not likely to be very forgiving of that drawback, especially if forced to repeat the process repeatedly on a multi-TB system.

We can prepare for all of this, and circumvent much to a certain extent, but Postgres can do better.

Share this