One of the interesting features in PostgreSQL since version 9.4 is the ability to control removal of WAL files using replication slots. The dark side is that replication slots can cause disks to fill up with old WAL, killing the main production server. In this article I explain PostgreSQL replication slots, and how a new feature in PostgreSQL 13 helps prevent this problem.
WAL Production
As you know, WAL is produced for database changes in a primary server: inserts, updates, et cetera. A more active database will produce more WAL — in a very active server, there can be many gigabytes of WAL produced every minute. WAL is written to files with names in an increasing numerical sequence, and the files are always the same size (16 MB is default and typical). Once the data in a file is no longer needed, that file can be recycled, which means to rename it to a higher-numbered position in the sequence so that it can be filled with new data later.
(There are special situations such as a surge in activity that leads to the creation of additional files; when later the surge dies down, those extra files are removed instead of recycled.)
Because all database write activity produces WAL, it is critical for disk space to be available. When the disk storing WAL is full, the server will be unable to process new transactions and can become stuck, or worse: it may fall over completely. So this is a situation to be avoided by all means possible.
Replication Slots
Replication in PostgreSQL works by processing WAL files. In order for this to work, all WAL files have to be transiently available until they are processed. Therefore a mechanism is needed to tell the main WAL management not to recycle or remove files.
Enter replication slots. Slots are a mechanism that indicates that this backup we’re taking will require that WAL file, and can you please not delete it yet; or this replica still hasn’t processed that WAL file, so can it please be left alone for a little while.
By themselves, replication slots occupy very little disk space. They just store tiny bit of metadata, including a pointer to a position in WAL. But the WAL data that it protects is another matter: in a highly active server, it can be measured in gigabytes or worse.
WAL Consumption
Feeding data to a physical replica means to copy the WAL data from its primary server. Similarly, a logical replica needs to read WAL data (and transmit an interpreted version to the replica). The WAL position being read is what the slot keeps track of. Once the replica has secured the WAL data somehow, the slot can be advanced; this tells WAL management in the primary that the WAL file is then available for removal. This happens continuously when the replica is active, so that WAL in the primary server will use the same amount of disk space or maybe just a little more. Even twice as much or ten times as much might be acceptable, depending on conditions.
The problem is that if a replica dies completely and doesn’t recover for a long period of time; or the replica is destroyed and the DBA forgets to remove the replication slot; or the slot is a forgotten leftover of some experiment; or even the replica is being fed over a slow network link, then the reserved WAL will grow without bounds. And that becomes a ticking bomb.
Limiting Slot Size
In order to fight this problem, Kyotaro Horiguchi had been working since February 2017 in a PostgreSQL patch to limit the size of WAL reserved by a slot. After a very long review and rework process I integrated it for PostgreSQL 13, improving management of high-availability PostgreSQL farms.
The main principle is that it is better to kill a replica (by somehow making its slot invalid; more on that below) than killing the primary server that feeds that replica and take all production down with it.
The way it works is pretty straightforward: set max_slot_wal_keep_size
(documentation) in postgresql.conf to the maximum amount of disk space of WAL that replication slots are allowed to reserve. If a slot reaches that point and a checkpoint occurs, that slot will be marked invalid and some WAL files may be deleted. If the slot was in active use by a walsender process, that process will be signalled so that it terminates. If the walsender starts again, it’ll find that necessary WAL files won’t be there anymore. The replica using that slot will have to be recloned.
If max_slot_wal_keep_size
is zero, which is the default value, then there’s no limit. I don’t recommend this, because it leads to failures when slots fill the disk.
Monitoring Slot Health
Also included are some monitoring features. Two columns in pg_replication_slots are relevant. The most critical one is wal_status
. If that column is reserved
, then the slot is pointing to data within max_wal_size
; if it is extended
then it exceeded max_wal_size
, but is still protected by either wal_keep_size
or max_slot_wal_keep_size
(including when max_slot_wal_keep_size
is zero). Either state is good and normal. However, when a slot gets over the limit, it first becomes unreserved
, which means it’s in imminent danger, but can still recover if it’s quick enough. Finally, the status becomes lost
when WAL files have been removed and no recovery is possibleπ.
The other column is safe_wal_size
: it shows the number of bytes of WAL that can be written before this slot gets in danger of having WAL files removed. We suggest to keep a close eye on this column in your monitoring system, and fire alerts when it gets low. Zero or negative means your replica will be dead as soon as a checkpoint occurs:
SELECT slot_name, active, wal_status, safe_wal_size
FROM pg_catalog.pg_replication_slots;
We believe that this new feature makes maintenance of replicas easier and more robust; hopefully we won’t see any more disasters with production down because of these problems.
(A note: safe_wal_size
was introduced in 13beta3, so be sure to consult up-to-date documentation, or you’ll see min_safe_lsn
instead. Ignore that.)
Thanks
Special thanks to Kyotaro Horiguchi for working on solving this problem. Several reviewers got deep on this, among whom I’d like to thank especially Masahiko Sawada, Fujii Masao, Jehan-Guillaume de Rorthais, and Amit Kapila (in no particular order).