Tuning max_wal_size in PostgreSQL

March 17, 2023

My current focus at work is automatic tuning of PostgreSQL, which as I'm sure you can imagine involves a lot of benchmarking and challenging our existing knowledge of how to manually tune PostgreSQL.

One of the biggest surprises to a number of us was the effect of tuning the max_wal_size parameter. Our testing with HammerDB's TPROC-C benchmark has shown that on systems with magnetic disks, properly tuning this GUC is second only to proper sizing of shared_buffers when it comes to performance optimization; tuning of shared_buffers shows a 3.5x performance increase over stock PostgreSQL, and tuning of max_wal_size offers a further ~1.5x increase on our hardware. In one test on ephemeral storage in the cloud we actually saw a 10x performance increase after tuning max_wal_size!

A little background info

It helps to understand a little about how PostgreSQL processes transactions to explain this parameter and what effect it has on performance. The following explanation assumes a standard production configuration, and is of course, somewhat simplified.

When changes are made to data in PostgreSQL, those changes are written to the Write Ahead Log (WAL), and only once that has been synced to disk is the transaction considered durable – the D in ACID. At this point the server will inform the client that the transaction is committed. At some point in the future a checkpoint will be run during which the changes will also be written back to the database itself, or the heap as it's often known. 

If there is a server crash before the changes are written to the heap, at restart the changes recorded in the WAL will be replayed from the point of the last checkpoint to bring the heap up to date with the most recent committed transactions – at which point normal operations can resume.

So, assuming that the WAL is not also being used for replication or log shipping purposes, we always need to store enough WAL segments to record all the changes since the last checkpoint (in actuality 2-3 checkpoints worth are usually stored, but that's not important here). 

So what is max_wal_size?

The max_wal_size parameter tells PostgreSQL what the approximate maximum total size of the stored WAL segments should be. Once the allowed space is exhausted, a checkpoint is requested so that the space can be recycled.

On a well configured system, the vast majority of checkpoints should be timed (based on the checkpoint_timeout parameter) rather than requested. This ensures that checkpoints happen on a regular, predictable schedule, allowing the load to be evenly spread throughout the normal operation of the system. Requested checkpoints are inherently unpredictable and thus can cause variations in performance by adding additional load when it is not expected. This is particularly disruptive on systems with slow I/O, such as those with magnetic drives or virtual machines with limits on IOPs.

Tuning

When tuning max_wal_size, the aim is to ensure that the vast majority of checkpoints are timed. Occasional requested checkpoints are quite normal in response to unexpected bursts of activity, but they should be the exception not the norm.

Our aim when tuning is to keep the ratio of timed to requested checkpoints as high as possible, whilst ensuring that we do not use more disk space for the WAL than we can afford. In an ideal situation there will be enough disk space available that we can set max_wal_size to a very large number to almost completely eliminate requested checkpoints. On high throughput systems this number might be in the tens or hundreds of gigabytes range.

It is worth noting that as more WAL is generated between checkpoints, the recovery time following an unclean shutdown increases, as does the amount of I/O that each checkpoint requires. Effects of this can be mitigated by tuning checkpoint_timeout to a lower value (we typically recommend 15 minutes as a starting point) so that timed checkpoints occur more frequently, and ensuring that checkpoint_completion_target is set to a value such as 0.9 to spread the work over as much of the timed checkpoint interval as possible. On the other hand, assuming full_page_writes is enabled – which it should be – the first time a page is written to WAL following a checkpoint the whole page is written regardless of how much of it actually changed, and as that clearly requires more I/O we don't want checkpoints to happen too frequently!

Monitoring

PostgreSQL includes a view called pg_stat_bgwriter that can be used to monitor the number of requested and timed checkpoints (the checkpoints_req and checkpoints_timed columns respectively). It is well worth monitoring those values, along with the available space on the partition used to store the WAL, to ensure that the ratio of requested to timed checkpoints is kept at an optimal level given the disk space available.

The output below shows the worst possible situation! This is from an untuned system that has just been started for the first time and is performing a bulk data load:

tprocc=# SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req 
-------------------+-----------------
                 0 |             188
(1 row)

Conclusion

PostgreSQL's max_wal_size configuration parameter can have a profound effect on performance with workloads that update the database. Tuning it is very system and workload specific, and requires finding the optimal balance between the disk space available for storing WAL, whilst maximizing the ratio of timed to requested checkpoints.

Share this