A few weeks ago I explained basics of autovacuum tuning. At the end of that post I promised to look into problems with vacuuming soon. Well, it took a bit longer than I planned, but here we go.
To quickly recap, autovacuum
is a background process cleaning up dead rows, e.g. old deleted row versions. You can also perform the cleanup manually by running VACUUM
, but autovacuum
does that automatically depending on the amount of dead rows in the table, at the right moment – not too often but frequently enough to keep the amount of “garbage” under control.
Generally speaking, autovacuum
can’t be running too often – the cleanup is only performed after reaching some number dead rows accumulates in the table. But it may be delayed for various reasons, resulting in tables and indexes getting larger than desirable. And that’s exactly the topic of this post. So what are the common culprits and how to identify them?
Throttling
As explained in tuning basics, autovacuum
workers are throttled to only perform certain amount of work per time interval. The default limits are fairly low – about 4MB/s of writes, 8MB/s of reads. That is suitable for tiny machines like Raspberry Pi or small servers from 10 years ago, but current machines are way more powerful (both in terms of CPU and I/O) and handle much more data.
Imagine you have a few large tables and some small ones. If all three autovacuum
workers start cleaning up the large tables, none of the small tables will get vacuumed regardless of the amount of dead rows they accumulate. Identifying this is not particularly difficult, assuming you have sufficient monitoring. Look for periods when all autovacuum
workers are busy while tables are not vacuumed despite accumulating many dead rows.
All the necessary information is in pg_stat_activity
(number of autovacuum
worker processes) and pg_stat_all_tables
(last_autovacuum
and n_dead_tup
).
Increasing the number of autovacuum
workers is not a solution, as the total amount of work remains the same. You can specify per-table throttling limits, excluding that worker from the total limit, but that still does not guarantee there will be available workers when needed.
The right solution is tuning the throttling, using limits reasonable with respect to the hardware configuration and workload patterns. Some basic throttling recommendations are mentioned in the previous post. (Obviously, if you can reduce the amount of dead rows generated in the database, that would be an ideal solution.)
From this point we’ll assume the throttling is not the issue, i.e. that the autovacuum
workers are not saturated for long periods of time, and that the cleanup is triggered on all tables without unreasonable delays.
Long transactions
So, if the table is vacuumed regularly, surely it can’t accumulate a lot of dead rows, right? Unfortunately, no. The rows are not actually “removable” immediately after getting deleted, but only when there are no transactions that might possibly see them. The exact behavior depends on what the other transactions are (were) doing and serialization level, but in general:
READ COMMITTED
- running queries block cleanup
- idle transactions block cleanup only if they performed a write
- idle transactions (without any writes) won’t block cleanup (but it’s not a good practice to keep them around anyway)
SERIALIZABLE
- running queries block cleanup
- idle transactions block cleanup (even if they only did reads)
In practice it’s more nuanced of course, but explaining all the various bits would require first explaining how XIDs and snapshots work, and that’s not the goal of this post. What you really should take away from this is that long transactions are a bad idea, particularly if those transactions might have done writes.
Of course, there are perfectly valid reasons why you may need to keep transactions for long periods of time (e.g. if you need to ensure ACID for all the changes). But make sure it does not happen unnecessarily, e.g. due to a poor application design.
A somewhat unexpected consequence of this is high CPU and I/O usage, due to autovacuum
running over and over, without cleaning any dead rows (or just a few of them). Because of that the tables are still eligible for cleanup on the next round, causing more harm than good.
How to detect this? Firstly, you need to monitor long-running transactions, particularly idle ones. All you need to do is reading data from pg_stat_activity
. The view definition changes a bit with PostgreSQL version, so you may need to tweak this a bit:
SELECT xact_start, state FROM pg_stat_activity;
-- count 'idle' transactions longer than 15 minutes (since BEGIN)
SELECT COUNT(*) FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND (now() - xact_start) > interval '15 minutes'
-- count transactions 'idle' for more than 5 minutes
SELECT COUNT(*) FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND (now() - state_change) > interval '5 minutes'
You may also simply use some existing monitoring plugin, e.g. check_postgres.pl. Those already include this type of sanity checks. You’ll have to decide what is a reasonable transaction/query duration, which is application-specific.
Since PostgreSQL 9.6 you can also use idle_in_transaction_session_timeout
so that transactions idle for too long are terminated automatically. Similarly, for long queries there’s statement_timeout
.
Another useful thing is VACUUM VERBOSE
which will actually tell you how many dead rows could not be removed yet:
db=# VACUUM verbose z;
INFO: vacuuming "public.z"
INFO: "z": found 0 removable, 66797 nonremovable row versions in 443 out of 443 pages
DETAIL: 12308 dead row versions cannot be removed yet.
...
It will not tell you which backend is preventing the cleanup, but it’s a pretty clear sign of what’s happening.
Note:. You can’t easily get this information from autovacuum
because it’s only logged with DEBUG2
by default (and you surely don’t want to run with that log level in production).
Long queries on hot standbys
Let’s assume tables are being vacuumed in a timely manner, but not removing dead tuples, resulting in table and index bloat. You are monitoring pg_stat_activity
and there are no long-running transactions. What could be the issue?
If you have a streaming replica, chances are the issue might be there. If the replica uses hot_standby_feedback=on
, queries on the replica act pretty much as transactions on the primary, including blocking cleanup. Of course, hot_standby_feedback=on
is used exactly when running long queries (e.g. analytics and BI workloads) on replicas, to prevent cancellations due to replication conflicts.
Unfortunately, you’ll have to choose – either keep hot_standby_feedback=on
and accept delays in cleanup, or deal with canceled queries. You might also use max_standby_streaming_delay
to limit the impact, although that does not prevent the cancellations entirely (so you still need to retry the queries).
Actually, there’s a third option now – logical replication. Instead of using physical streaming replication for the BI replica, you can copy the changes using the new logical replication, available in PostgreSQL 10. Logical replication relaxes the coupling between the primary and replica, and makes the clusters mostly independent (are cleaned up independently, etc.).
This solves the two issues associated with physical streaming replication – delayed cleanup on primary or canceled queries on the BI replica. For replicas serving DR purposes streaming replication remains to be the right choice, though. But those replicas are not (or should not be) running long queries.
Note: While I mentioned that logical replication will be available in PostgreSQL 10, a significant portion of the infrastructure was available in previous releases (particularly PostgreSQL 9.6). So you might be able to do this even on older releases (we did that for some of our customers), but PostgreSQL 10 will make it much more convenient and comfortable.
Trouble with autoanalyze
A detail you might miss is that autovacuum
workers actually performs two different tasks. Firstly the cleanup (as if running VACUUM
), but also collecting statistics (as if running ANALYZE
). And both parts are throttled using autovacuum_cost_limit
.
But there’s a big difference in handling transactions. Whenever the VACUUM
part reaches autovacuum_cost_limit
, the worker releases the snapshot and sleeps for a while. The ANALYZE
however has to run in a single snapshot/transaction, which does block cleanup.
This is an elegant way to shoot yourself in the foot, particularly if you also do some of this:
- increase
default_statistics_target
to build more accurate statistics from larger samples - lower
autovacuum_analyze_scale_factor
to collect statistics more frequently
The unintended consequence of course is that ANALYZE
will be happening more frequently, will take much longer and will (unlike the VACUUM
part) prevent cleanup. The solution is usually fairly simple – don’t lower autovacuum_analyze_scale_factor
too much. Running ANALYZE
every time 10% of the table changes should be more than enough in most cases.
n_dead_tup
One last thing I’d like to mention is about changes in pg_stat_all_tables.n_dead_tup
values. You might think that the value is a simple counter, incremented whenever a new dead tuple is created and decremented whenever it’s cleaned up. But it’s actually only an estimate of the number of dead tuples, updated by ANALYZE
. For small tables (less than 240MB) it’s not really a big difference, because ANALYZE
reads the whole table and so it’s pretty exact. For large tables it may however change quite a bit depending on what subset of table gets sampled. And lowering autovacuum_vacuum_scale_factor
makes it more random.
So be careful when looking at n_dead_tup
in a monitoring system. Sudden drops or increases in the value may be simply due to ANALYZE
recomputing a different estimate, and not due to actual cleanup and/or new dead tuples appearing in the table.
Summary
To summarize this into a few simple points:
autovacuum
can only do it’s work if there are no transactions that might need the dead tuples.- Long-running queries do block cleanup. Consider using
statement_timeout
to limit the damage. - Long-running transaction may block cleanup. The exact behavior depends on things like isolation level or what happened in the transaction. Monitor them and terminate them if possible.
- Long-running queries on replicas with
hot_standby_feedback=on
may also block cleanup. autoanalyze
is throttled too, but unlike theVACUUM
part it keeps a single snapshot (and thus blocks cleanup).n_dead_tup
is just an estimate maintained byANALYZE
, so expect some fluctuation (especially on large tables).