A few weeks ago I covered the basics of tuning checkpoints, and in that post I also mentioned autovacuum as the second common source of performance issues (based on what we see on the mailing list and at our customers). Let me follow-up on that with this post about how to tune autovacuum, to minimize the risk of performance issues. In this post I'll briefly explain why we even need autovacuum (dead rows, bloat and how autovacuum deals with it), and then move to the main focus of this blog post - tuning. I’ll go over all the relevant configuration options, and some basic rules for tuning them.
Note: This is a refreshed version of a blog post I originally wrote in 2016, updated to reflect various changes in PostgreSQL configuration. Otherwise the overall tuning approach remains mostly the same.
What are dead rows?
Before we start talking about tuning autovacuum, we first need to understand what are “dead rows” and why is the cleanup performed by autovacuum actually needed …
Note: Some resources say “dead tuples” instead of “dead rows” - it’s pretty much just a different name for the same thing. The term “tuple” comes from relational algebra (the foundation of relational databases) and is therefore more abstract, while the meaning of “row” is usually closer to the implementation. But those differences are mostly ignored, and the terms are used interchangeably. I’ll stick to “rows” in this post.
When you do a DELETE
in Postgres, the row (aka row) is not immediately removed from the data file. Instead it is only marked as deleted by setting an xmax
field in a header. Similarly for UPDATE
, which in Postgres is mostly equivalent to DELETE + INSERT
.
This is one of the basic ideas of MVCC (multi-version concurrency control) in Postgres, allowing different processes to see different subsets of the rows depending on when they obtained a snapshot. For example a SELECT
started before the DELETE
should see the old rows, while a SELECT
started after the DELETE
commits should see the new versions (at least for the default READ COMMITTED
isolation level).
Other MVCC implementations with different approaches, but Postgres creates copies of the rows. The downside of this MVCC implementation is that it leaves behind the deleted rows, even after all the transactions that might see those versions finish.
So at the end you have a row that is “marked” as deleted, but still takes space in the data file. If you have updated a row 100 times, there will be 101 copies of the row. If not cleaned up, those "dead rows" (effectively invisible to any future transaction) will remain in the data files forever, wasting disk space. For tables with a lot of DELETE
s and/or UPDATE
s, dead rows may easily account for the vast majority of disk space. And of course, those dead rows are still referenced from indexes, increasing the amount of wasted disk space further.
This is what we call "bloat" in PostgreSQL - the tables and indexes are bloated compared to the minimum required amount of space. And naturally, if queries have to process more data (even if 99% of it is immediately thrown away as "invisible"), this impacts performance of those queries.
VACUUM and autovacuum
The most straightforward way to reclaim the wasted space (occupied by dead rows) is by running the VACUUM
command manually. This maintenance command will scan the table and remove dead rows both from the table and indexes. It will not return the disk space back to the operating system / filesystem, but it will just make it available for new rows.
For example, if you have a 10GB table, but 9GB of that is used by dead rows, VACUUM
will complete and the table is still going to be 10GB. But then the next 9GB of rows won’t need to extend the table - it’ll use the reclaimed space in the data files. Of course, this example is a bit extreme - you should not allow 9GB of dead rows in the table. We’ll get to that shortly.
Note:
VACUUM FULL
would reclaim the space and return it to the OS, but it has a number of disadvantages. It acquires an exclusive lock on the table, blocking all operations (including read-only queries), and it creates a fresh copy of the table, possibly doubling the amount of disk space used (so it's particularly impractical when you’re already running out of disk space).
The trouble with running VACUUM
manually is that it only happens when you decide to run it, not when it's needed. If you run it every 5 minutes on all tables, it’s likely most of the runs will not actually clean anything. It’ll just check the table, maybe read some of the data, only to realize there’s nothing that could be cleaned yet. So you’ll be just wasting CPU and I/O resources. Or you may choose to run it less frequently, say once a day at night, tables may easily accumulate more dead rows that you'd like.
In other words, it’s very difficult to get the VACUUM
schedule right, particularly for workloads that may change over time (due to changes of user activity during the data/week, or due to changes in the application). The one exception is a system that has a very clear schedule for bulk data loads (e.g. analytical database with nightly batch loads).
This leads us to autovacuum - a background process responsible for triggering cleanup in a timely manner. Often enough to keep the amount of wasted space under control, but not too often. The trick is that the database does track the number of dead rows produced over time (each transaction reports the number of updated/deleted rows), so it can trigger the cleanup when a table accumulates a certain number of them. Which means during busy periods the cleanup will happen more often.
autoanalyze
Cleanup of dead rows is not the only responsibility of the autovacuum process. It's also responsible for collecting data distribution statistics, used for query planning. You may collect those manually using ANALYZE
, but it suffers similar issues as VACUUM
- it’s hard to run it often enough but not too often. And the solution is also similar - the database can watch the number of modified rows, and run ANALYZE
automatically when it crosses some threshold.
Note: The negative effects are a bit more severe for
ANALYZE
, because while the cost ofVACUUM
is mostly proportional to the amount of dead rows (so fairly low when the table has few of them, making unnecessary cleanup cheap),ANALYZE
has to actually rebuild the statistics on every execution (which includes collecting a random sample, which may require a considerable amount of I/O). On the other hand, bloat in a table may slow down the queries a bit, because of having to do a bit more I/O. But having stale statistics means a significant risk of choosing inefficient query plans and queries consuming a lot of resources (time, CPU, I/O). The difference may be exponential.
I'll ignore this autovacuum task in the rest of this blog post - the configuration is fairly similar to the cleanup, and follows roughly the same reasoning.
Monitoring
Before starting to tune something, you need to be able to collect relevant data. Otherwise how would you know if you need to do any tuning at all, or evaluate the impact of the changes? In other words, you need to have some basic monitoring in place, regularly collecting important metrics from the database.
If you’re using some monitoring plugin (and you definitely should, there’s plenty of options, both free and commercial), chances are you already have this data. But in this post I’ll demonstrate the tuning using statistics available from the database.
For cleanup, you need to be looking at least at these values:
pg_stat_all_tables.n_dead_tup
- number of dead rows in each table (both user tables and system catalogs)(n_dead_tup / n_live_tup)
- ratio of dead/live rows in each table(pg_class.relpages / pg_class.relrows)
- space "per row"
There's also a handy pgstattuple extension, allowing you to perform analysis on tables and indexes, including computing the amount of free space, dead rows etc.
Tuning Goals
Before looking at the actual configuration parameters, let's briefly discuss what are the high-level tuning goals, i.e. what we want to achieve when changing the parameters:
- cleanup dead rows - Keep the amount of disk space reasonably low, not to waste unreasonable amounts of disk space, prevent index bloat and keep queries fast.
- minimize cleanup impact - Don't perform cleanup too often, as it would waste resources (CPU, I/O and RAM) and might significantly hurt performance.
That is, you need to find the right balance - running the cleanup too often may be just as bad as not running it often enough. The trade off strongly depends on the amount of data, the type of workload you are dealing with (particularly the number of DELETE
/UPDATE
).
Most configuration options have very conservative defaults. Firstly, the default values were often chosen years ago, based on the resources (CPU, RAM, ...) available at that time. Secondly, we want the default configuration to work everywhere, including tiny machines like Raspberry Pi or small VPS servers. We do adjust the defaults from time to time (as we’ll see later in this blog post), but even then we tend to make small careful steps.
For smaller systems and/or systems handling read-mostly workloads, the default configuration parameters work fine, but for large systems some tuning is needed. As the database size and/or amount of writes increase, problems start to appear.
With autovacuum the typical issue is that the cleanup does not happen often enough, and then when it finally happens it significantly disrupts performance of queries, as it has to deal with a lot of garbage. In those cases you should follow this simple rule:
If it hurts, you're not doing it often enough.
That is, you need to tune the autovacuum configuration so that the cleanup happens more often, but processes a smaller amount of dead rows on every execution.
Now that we know what we want to achieve, let's see the configuration parameters ...
Note: People sometimes follow a different rule - If it hurts, don't do it. - and just disable autovacuum. Please don't do that unless you really (really really) know what you're doing, and have a regularly executed cleanup script (e.g. from cron). Otherwise you're painting yourself in the corner, and instead of somewhat degraded performance you'll eventually have to deal with severely degraded performance or possibly even an outage.
Thresholds and Scale Factors
Naturally, the first thing you may tweak is when the cleanup gets triggered, which is affected by two parameters (with these default values):
autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.2
The cleanup is triggered whenever the number of dead rows for a table (which you can see as pg_stat_all_tables.n_dead_tup
) exceeds
threshold + pg_class.relrows * scale_factor
This formula says that up to 20% of a table may be dead rows before it gets cleaned up (the threshold of 50 rows is there to prevent very frequent cleanups of tiny tables, but for large tables it’s dwarfed by the scale factor).
So, what’s wrong with these defaults, particularly with the scale factor? This value essentially determines what fraction of a table can be “wasted”, and 20% works pretty well for small and medium sized tables - on a 10GB table this allows up to 2GB of dead rows. But for a 1TB table this means we can accumulate up to 200GB of dead rows, and then when the cleanup finally happens it will have to do a lot of work at once.
This is an example of accumulating a lot of dead rows, and having to clean up all of it at once, which is going to hurt - it’s going to use a lot of I/O and CPU, generate WAL and so on. Which is exactly the disruption of other backends that we’d like to prevent.
Per the rule mentioned earlier, the proper solution is to trigger the cleanup more often. This can be done by significantly decreasing the scale factor, perhaps like this:
autovacuum_vacuum_scale_factor = 0.01
This decreases the limit to only 1% of the table, to ~10GB of the 1TB table. Alternatively you could abandon the scale factor entirely, and rely solely on the threshold:
autovacuum_vacuum_scale_factor = 0
autovacuum_vacuum_threshold = 10000
This would trigger the cleanup after generating 10000 dead rows.
One thing to consider is that just reducing the scale factor makes it easier to trigger cleanup on small tables - if you have a table with 1000 rows, scale factor 1% means it’s enough to update 10 rows for the table to qualify for cleanup. Which seems overly aggressive.
The easiest solution is to just ignore this as a non-problem. Cleanup of small tables is likely very cheap, and the improvement on large tables is usually so significant that even with processing small tables more often, the overall effect is still very positive.
But if you want to prevent this, increase the threshold value a little bit. You might end up with something like this:
autovacuum_vacuum_scale_factor = 0.01
autovacuum_vacuum_threshold = 1000
Which triggers cleanup of a table once it accumulates 1000 + 1% of dead rows. For small tables the 1000 will dominate the decision, for large tables the 1% scale factor will matter more.
Consider also that parameters in postgresql.conf
affect the whole cluster (all tables in all databases). If you have only a handful of large tables, you may also consider modifying the parameters only for them using ALTER TABLE
:
ALTER TABLE large_table SET (autovacuum_vacuum_scale_factor = 0.01);
ALTER TABLE large_table SET (autovacuum_vacuum_threshold = 10000);
I strongly recommend trying to only modify the postgresql.conf parameters, and resorting to ALTER TABLE only when that’s not sufficient. It makes debugging and analysis of cleanup behavior much more complicated. And if you end up doing that, make sure to document the reasoning for individual tables.
Note: How low should you go with these parameters? Why not to go to 0.001, i.e. to only 0.1% (1GB) for a 1TB table? You can try, but I don’t recommend such low values. The “wasted” space serves as a buffer for new data, giving the system a bit of slack. Going from 200GB to 10GB saves a lot of space while still leaving a lot of room for new rows, while for 10GB to 1GB the benefit is much smaller. It’s not worth the risk of triggering cleanup prematurely (before the rows can be actually cleaned, forcing the cleanup to be attempted again).
Throttling
An important feature built into the autovacuum system is throttling. The cleanup is meant to be a background maintenance task, with minimum impact on user queries etc. We certainly don’t want the cleanup to consume so much resources (CPU and disk I/O) to affect regular user activity (e.g. by making queries much slower). That requires limiting the amount of resources the cleanup can utilize over time.
The cleanup process is fairly simple - it reads pages (8kB chunks of data) from data files, and checks if the page needs cleaning up. If there are no dead rows, the page is simply thrown away without any changes. Otherwise it's cleaned up (dead rows are removed), marked as "dirty" and eventually written out (to cache and eventually to disk).
Let’s assign “costs” to the basic cases, expressing the amount of resources needed (the values changed in PG 14, I’ll use the new values unless explicitly mentioned otherwise):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This says that if a page is found in shared buffers, that costs 1 token. If it has to be read from the OS (and maybe from disk), it’s considered a bit more expensive and costs 2 tokens. And finally, if the page is dirtied by the cleanup and needs to be written out, it counts as 20. This allows us to account for work done by autovacuum over time.
The throttling is then done by limiting the amount of work that can be done in one go, which is by default set to 200, and every time the cleanup does this much work it'll sleep for a little bit. The sleep used to be 20ms, but in PG 12 it was reduced to 2ms.
autovacuum_vacuum_cost_delay = 2ms
autovacuum_vacuum_cost_limit = 200
Let’s compute how much work that actually allows. The delay changed in PG 12, which means we have three groups of releases, with different parameter parameter values.
With a 2ms delay, the cleanup can run 500 times per second, and with 200 tokens per round that means a budget of 100000 tokens per second (on older releases before PG 12 the limit is 10000). Considering the costs of cleanup operations discussed earlier, that means:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Considering the capabilities of current hardware (assuming local storage), the default limits on releases before PG 14 are probably too low. It’s a good idea to either increase the cost limit (perhaps to 1000-2000, increasing the throughput 5-10x), or lower the cost delay in a similar manner. You may of course tweak the other parameters (cost per page operation, sleep delay), but we don’t do that very often - changing the cost limit works well enough.
In PG 14 the limits are significantly higher, making the default values more appropriate.
Note: Regular VACUUM has the same throttling mechanism, but it’s disabled by default (
vacuum_cost_delay
is set to 0). But if you need to run manual VACUUM, you can enable the throttling to limit the impact on the rest of the database (user queries etc.).
Number of Workers
One configuration option not yet mentioned is autovacuum_max_workers
, so what's that about? The cleanup is not performed by a single autovacuum
process - instead, the database starts up to autovacuum_max_workers
processes that do the actual cleanup (each worker processes a single table at a time). The default configuration allows up to 3 such workers.
That's definitely useful, because you for example don't want to stop cleanup of small tables until a single large table gets cleaned up (which may take quite a bit of time, due to the throttling).
The trouble is users assume increasing the number of workers also increases the amount of cleanup that can happen. If you allow 6 autovacuum workers instead of the default 3, it'll do twice as much cleanup, right?
Unfortunately, no. The cost limit, described a few paragraphs ago, is global - shared by all running autovacuum workers. Each worker process only gets a fraction of the cost limit (roughly 1/autovacuum_max_workers
) of the global limit. So increasing the number of workers only makes them go slower, it doesn’t increase the cleanup throughput.
It's a bit like a highway - doubling the number of cars but making them go half the speed will give you roughly the same number of people reaching the destination per hour.
So if the cleanup on your database can't keep up with user activity, increasing the number of workers is not going to help. You need to tweak the other parameters first.
Per-table Throttling
Actually, when I mentioned that the cost limit is global and shared by all autovacuum workers, that wasn’t entirely correct. Similarly to scale factor and threshold, the cost limit and delay may be defined using per table:
ALTER TABLE t SET (autovacuum_vacuum_cost_limit = 1000);
ALTER TABLE t SET (autovacuum_vacuum_cost_delay = 10);
Workers processing such tables with a custom limit are however not included in the global costing, and instead are throttled independently (i.e. the limit applies to this single worker).
In practice, we almost never use this feature, and we recommend not using it. It makes the cleanup behavior much harder to predict and reason about - having multiple workers that are sometimes throttled together and sometimes independently makes this very complex. You probably want to use a single global limit on the background cleanup.
Summary
If I had to sum it into a few basic rules, it'd be these five:
- Don't disable autovacuum, unless you really know what you're doing. Seriously.
- On busy databases (doing a lot of UPDATEs and DELETEs), particularly large ones, you should probably decrease the scale factor, so that cleanup happens more frequently.
- On reasonable hardware (good storage, multiple cores), you may need to increase the throttling parameters, so that the cleanup can keep up. This applies especially to older releases, before Postgres 14.
- Increasing autovacuum_max_workers alone will not help in most cases, as it does not increase cleanup throughput. You'll get more processes that go slower.
- You can set the parameters per table using ALTER TABLE, but think twice if you really need that. It makes the system more complex and more difficult to inspect.
I originally included a few sections explaining cases when autovacuum does not really work, and how to detect them (and what is the best solution), but the blog post is already too long so I'll leave that for a separate blog post.