On systems doing non-trivial number of writes, tuning checkpoints is crucial for getting good performance. Yet checkpoints are one of the areas where we often identify confusion and configuration issues, both on the community mailing lists and during support and consulting for our customers. This post is meant to explain what checkpoints are - the purpose and how the database implements that - and then also how to tune them.
Note: This is a refreshed version of a blog post I originally wrote in 2016, updated to reflect various changes in PostgreSQL configuration. Otherwise the overall tuning approach remains mostly the same.
What is the point of checkpoints?
PostgreSQL is one of the databases relying on write-ahead log (WAL) - before any change to a data file is performed, it is recorded in a separate log (a stream of changes). That provides durability, because in case of a crash the database may use the WAL to perform recovery - read the changes from WAL and re-apply them to data files.
At first this may seem inefficient, as it doubles the amount of writes - we're changing the data files and also writing the changes to the log. But it may actually improve performance, for a couple reasons. The COMMIT only needs to wait for the WAL to be durable (written and flushed to disk), while the data files are modified only in memory (shared buffers) and then maybe written to disk / flushed sometime later. Which is good, because WAL is written in a sequential way, while writes to data files are often rather random, so flushing WAL is much cheaper. Also, a data page may be modified multiple times while in shared buffers, and then persisted in a single physical I/O write - another significant benefit.
Let's say the system crashed and the database needs to perform recovery. The simplest approach would be to start from scratch - start with a fresh instance and apply all WAL from the very beginning. In the end we should get a complete (and correct) database. The obvious disadvantage is of course the need to keep and replay all WAL since the instance was created. We often deal with databases that are not extremely big (say, a few hundred GBs), but very active and produce several TBs of WAL per day. Imagine how much disk space you would need to keep all WAL when running the database for a year, and how long would a recovery take. Clearly, this does not seem like a very practical approach.
But what if the database could guarantee that for a given WAL position (offset in the log, called "log sequence number" - LSN), all data file changes up to that position are flushed to disk? Then it could determine this position during recovery, and replay only the remaining part of the WAL (since this position). This would significantly reduce the recovery time, and it could also discard WAL before that "known good" position, because it's not needed for recovery.
That is exactly what checkpoints are for - ensure that WAL before some LSN is no longer needed for recovery, reducing the disk space requirements and recovery time. The database simply looks at the current WAL position, and flushes all outstanding changes (that might need older WAL) to disk, and records the LSN in case a recovery is needed.
These checkpoints are generated by the database regularly, either in terms of time or WAL generated (since the previous checkpoint).
Note: If you happen to be a gamer, you are probably familiar with the concept of checkpoints - your character passes a certain point in the game, and if you fail to beat the next boss or fall into a lake of hot lava, you start from that last point instead of starting from the very beginning. Let's see how to achieve that in PostgreSQL ;-)
We'll discuss the configuration part influencing how often checkpoints happen later, but let's briefly talk about the two extreme configurations.
We've already described one extreme - when checkpoints don't happen at all, or happen only very infrequently. This maximizes some of the benefits (combining data file changes to a single asynchronous write, and reducing the need to do writes to data files in general), but it also has the disadvantage of having to keep all the WAL and perform lengthy recovery after crash / unclean shutdown.
Let's also discuss the other extreme - doing very frequent checkpoints (say, every second or so). That might seem like a great idea, as it would allow keeping only a tiny amount of WAL and the recovery would be very fast too (having to replay only that tiny amount of WAL). But it would also turn the asynchronous writes to data files into synchronous ones, and make it much less likely that multiple changes combine into a single physical write. That would seriously impact the users (e.g. increasing COMMIT latency, reducing throughput).
So in practice this is a trade off - we want the checkpoints to happen infrequently enough not to affect users, but also frequently enough to limit the recovery duration and disk space requirements.
Triggering checkpoints
There are about three or four ways how a checkpoint can be triggered:
- executing the CHECKPOINT command directly
- executing a command that requires a checkpoint (e.g. pg_backup_start, CREATE DATABASE, or pg_ctl stop|restart and a few others)
- reaching a configured amount of time since the last checkpoint (checkpoint_timeout)
- generating a configured amount of WAL since the previous checkpoint (max_wal_size)
The first two points are irrelevant for this blog post - those are rare events, triggered manually, associated mostly with maintenance operations. This post is about how to configure the two other cases, affecting regular periodic checkpoints.
Those time/size limits are set using two configuration options:
- checkpoint_timeout = 5min
- max_wal_size = 1GB (new option, introduced in PostgreSQL 9.5)
With these (default) values, PostgreSQL will trigger a CHECKPOINT every 5 minutes, or after writing about 1/2 GB of WAL, whichever happens first.
Note: The max_wal_size is a soft limit for total WAL size, which has two main consequences. Firstly, the database will try not to exceed the limit, but is allowed to, so keep enough free space on the partition and monitor it. Secondly, it's a limit on the total amount of WAL, not "per checkpoint" - because of spread checkpoints (explained later) the quota covers 1-2 checkpoints (or 2-3 before PG 11). So with max_wal_size = 1GB the database will start a CHECKPOINT after writing 500 - 1000 MB of WAL, depending on checkpoint_completion_target.
The default values are rather conservative (i.e. low), like many other default values in the sample configuration file, sized to work even on small systems like Raspberry Pi. Chances are you need to increase the limits significantly to get the most of the available hardware.
But how to determine good values for your system / application? As mentioned, the goal is to do checkpoints not too often and neither too infrequently. Our tuning "best practice" approach consists of two steps:
- choose a "reasonable" checkpoint_timeout value
- set max_wal_size high enough to be rarely reached
What would be a "reasonable" value for checkpoint_timeout? Clearly, the more we accumulate since the last checkpoint, the more work we'll need to do during recovery, so it's related to the recovery time objective (RTO), i.e. how fast we want the recovery to complete (after a crash).
Note: If you're concerned about the recovery taking a long time after a crash, because the system needs to meet some high-availability / SLA requirements, you should probably consider setting a replica instead of just relying on checkpoints. Systems may require a lengthy reboot, replacement of a hardware part, and so on. Shorter checkpoints can't solve any of that, and failover to a replica is a proven solution.
This is a bit tricky, because checkpoint_timeout is a limit on how long it took to generate the WAL, not a limit on the recovery time. However, while the WAL is usually generated by multiple processes (backends running DML), recovery is performed by a single process - therefore limited to a single CPU, can stall on I/O, etc. This affects not only the local recovery, but also streaming replication, where the replica may not be able to keep up with the primary. The recovery may have cold caches (e.g. right after a reboot), making the single-process particularly slow because of expensive I/O.
Note: PostgreSQL 15 introduced recovery_prefetch option, which allows asynchronous prefetching of data during recovery, addressing this "single process recovery" weakness, and possibly making the recovery even faster than the original workload that generated the data.
Furthermore, instances may have very different workloads - an instance handling write-heavy workload will generate far more WAL than a read-only instance over the same time period, for example.
It should be clear that longer checkpoint_timeout means more WAL and therefore longer recovery, but it still is not very clear what is a "reasonable" timeout value. Unfortunately, there's no simple formula that'd tell you what the "optimal" value is.
The default value (5 minutes) is however clearly too low, and production systems generally use values between 30 minutes and 1 hour. PostgreSQL 9.6 even increased the maximum to 1 day (so there clearly are senior hackers that think it might be a good idea for some systems). Low values may also result in write amplification due to full-page writes (which is a serious issue, but I'm not going to discuss it here for sake of brevity).
The checkpoint_timeout parameter is a good example of "diminishing returns" - benefits of increasing the value disappear fairly quickly. For example doubling the timeout from 5 to 10 minutes can yield clear improvements. Doubling it again from 10 to 20 minutes may improve things again, but the improvement is likely much smaller (while the costs - amount of WAL and recovery time - still double). And then for 20 to 40 minutes, the costs still double but the relative benefits are again much less significant. It's up to you to decide the right trade off.
Let's say we've decided to use 30 minutes, which in our experience is a reasonable value - not too low, not too high.
checkpoint_timeout = 30min
From now on, this is our target - we want checkpoints to happen every half an hour, on average. But if we do just this, the database may still trigger checkpoints more often because of hitting the limit on the amount of WAL between checkpoints - max_wal_size. And we don't want that - we aim for a checkpoint to be triggered only by checkpoint_timeout.
That means we need to estimate how much WAL the database produces over 30 minutes, so that we can use that for max_wal_size. There are several ways to calculate this estimate:
- Look at the actual WAL position (essentially offset in a file) using pg_current_wal_lsn(), and compute the difference between positions measured after 30 minutes.
Note: The function used to be called pg_current_xlog_location() up to PG 9.6.
- Enable log_checkpoints=on and then extract the information from the server log (there will be detailed statistics for each completed checkpoint, including the amount of WAL).
- Use pg_stat_bgwriter, which includes information about the number of checkpoints (which you can combine with the knowledge of current max_wal_size value).
Let's use the first approach, for example. On my test machine running pgbench, I see this:
test=# select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
C7/72C140D8
(1 row)
... after 5 minutes ...
test=# select pg_current_wal_lsn();
pg_current_wal_lsn
--------------------
C7/E8A494CF
(1 row)
test=# SELECT pg_wal_lsn_diff('C7/E8A494CF', 'C7/72C140D8');
pg_wal_lsn_diff
-----------------
1977832439
(1 row)
This shows that over the 5 minutes, the database generated ~1.8GB of WAL, so for a checkpoint_timeout = 30min that would be about 6 * 1.8GB = 11 GB of WAL. However as mentioned before, max_wal_size is a quota for 1 - 2 checkpoints, we need to use 22GB.
Note: The number of checkpoints covered by max_wal_size depends on checkpoint target value, discussed in the later section about spread checkpoints. It's safe to assume the higher value matters. Also, the number used to be 2 - 3 checkpoints until PG 11.
The other approaches collect the data differently, but the overall idea is the same and results should be comparable.
Estimates only
Those measurements are inherently inaccurate, because the frequency of checkpoints affects how many full-page writes we need to write, and full-page writes may be a major source of WAL write amplification.
The first change to each data page (8kB chunk of data in a data file) after a checkpoint triggers a full-page write, i.e. writing the full 8kB page into WAL. If you change a page twice (e.g. because doing two UPDATEs from different transactions), this may result in one or two full-page writes, depending on whether a checkpoint happened between them. This means that doing checkpoints less frequently may eliminate some of the full-page writes, reducing the amount of WAL written. So don't be surprised when the amount of WAL after making checkpoints less frequent is significantly lower than the sum of WAL produced by the more frequent checkpoints.
For example, assume checkpoints are happening every 10 minutes, and we write 10GB of WAL per checkpoint. If you want checkpoints to happen every 30 minutes, you may set
max_wal_size = 60GB
to cover 2 checkpoints, 30GB each, and that should do the trick. But if you then measure the amount of WAL generated per checkpoint, you may find it's only ~15GB of WAL on average (for example), instead of the expected 30GB. It's likely due to lower number of full-page writes, and ultimately it's a good thing - fewer full-page writes, lower WAL amplification, less WAL to archive and/or replicate, etc.
This means less frequent checkpoints may significantly reduce the amount of WAL you need to retain per your backup policy. It's typical to keep enough backups + WAL to perform PITR for the last month, i.e. you need to keep WAL for the last 30 days or so. If making checkpoints less frequent reduces the amount of full-page writes, the size of the WAL archive will reduce too.
Spread checkpoints
When I suggested that you only need to tune checkpoint_timeout and max_wal_size, I wasn't telling the whole truth. Those two parameters are certainly the two important ones, but there's a third one, called checkpoint_completion_target.
You probably don't need to tune it - the default is fairly reasonable, and if you tune the first two parameters, it'll be fine. But it's good to understand what a "spread checkpoint" is, so that you can tune it if needed.
During a CHECKPOINT, the database needs to perform these three basic steps:
- identify all dirty (modified) blocks in shared buffers
- write all those buffers to disk (or rather to filesystem cache)
- fsync() all the modified files to disk
Only when all those steps finish, the checkpoint can be considered complete - ultimately, it's the final "fsync" step that makes things durable.
You could do these steps "as fast as possible" i.e. write all the dirty buffers in one go and then call fsync() on all affected files, and in fact that's exactly what PostgreSQL did until version 8.2. But that leads to major I/O stalls due to filling filesystem caches, saturating the devices, and affecting user sessions. Imagine you have 8GB of modified data in shared buffers, and you write all of that to the OS at once and call fsync. The kernel will do all it can to satisfy your request as fast as possible, but that means the other processes will have to wait for I/O much longer. This is initiated from the background, but the impact on user sessions is massive.
To address this, PostgreSQL 8.3 introduced the concept of "spread checkpoints" - instead of writing all the data at once, the writes are spread over a long period of time. That gives the OS time to flush the dirty data in the background, making the final fsync() much cheaper, and reducing the impact on user sessions.
The writes are throttled based on progress towards the next checkpoint - the database knows how much time / WAL do we have left until another checkpoint will be needed, so it can calculate how many dirty buffers should have been written out (to the OS) at a given point in time.
The database however must not issue writes until the very last moment - that would mean the last batch of writes would still be in the filesystem cache, making the final fsync() calls (issued right before starting the next checkpoint) expensive again. Probably not as expensive as before (because the amount of dirty data would be lower), but if we can prevent that ...
The checkpoint_completion_target = 0.9 (used to be 0.5 until PG 14) says how far towards the next checkpoint should all the writes be complete. For example, assuming checkpoints are triggered every 30 minutes, the database will throttle the writes so that the last write is done after 30 * 0.9 = 27 minutes. This means the OS then has another 3 minutes to flush the data to disk, so that the fsync calls issued at the end of the checkpoint are cheap and fast.
Until PostgreSQL 9.6 we would need to worry about how quickly the kernel evicts dirty data from page cache, which is determined by vm.dirty_expire_centisecs (which is set to 30 seconds by default) and vm.dirty_background_bytes (amount of dirty data to start writing data out). And this would need to be considered when calculating how much time we should leave at the end of the checkpoint for the OS to write out the data.
PostgreSQL 9.6 however introduced an option checkpoint_flush_after which makes tuning these kernel parameters mostly unnecessary - the option instructs the database to do fsync regularly during CHECKPOINT, after writing tiny amounts of data (256kB by default). This greatly reduces the impact of fsync calls on other processes (no major I/O stalls). It also means we don't need to worry about the time left at the end of the checkpoint too much, because the amount of unflushed data is small.
Summary
So now you should know the purpose of checkpoints, and also the basics of tuning them. To sum this up:
- most checkpoints should be triggered by time (checkpoint_timeout)
- the internal is a compromise between throughput (infrequent checkpoints) and time needed for recovery (frequent checkpoints)
- most production system use timeout between 30-60 minutes, pick a value in this range unless you have data to support a different choice
- after deciding on the timeout, choose max_wal_size by estimating the amount of WAL
- set checkpoint_completion_target to 0.9 (if on older version, which had default 0.5)
- on older versions (before 9.6) without checkpoint_flush_after you may need to tune kernel (vm.dirty_expire_centisecs and vm.dirty_background_bytes)