2ndQuadrant Press

September 12, 2016

Since PostgreSQL 9.5, pg_rewind has been able to make a former master follow up a promoted standby although, in the meantime, it proceeded with its own timeline. Consider, for instance, the case of a switchover that didn’t work properly.

Have you ever experienced a "split brain" during a switchover operation? You know, when the goal is to switch the roles of the master and the standby, but instead you end up with two independent masters – each one with its own timeline? For PostgreSQL DBAs in HA contexts, this where pg_rewind comes in handy!

Until PostgreSQL 9.5, there was only one solution to this problem: re-synchronise the PGDATA of the downgraded master with a new base backup and add it to the HA cluster as a new standby node. Generally, this is not a problem, unless your database is relatively big. For instance, if there are hundreds of GBs, it is not easy to do this type of operation while trying to keep downtime as low as possible.

Restoring a database to a previous point in time can create some complexities that can be handled in different ways. For an in depth explanation of the evolution and components of fault tolerance in PostgreSQL I suggest you check out Gulcin’s series Evolution of Fault Tolerance in PostgreSQL, including a Time Travel feature that mentions the use of pg_rewind.

how pg_rewind works

pg_rewind scans the "old" master’s PGDATA folder, identifies the data blocks changed during the switch to the new timeline, then copies only these blocks from the promoted standby. This is then used to replace the changes. As a "collateral" effect, the configuration files are also copied from the promoted standby (so the DBA has to be careful to adapt them to the new role of the node in the HA cluster). However, this allows the prevention of re-syncing PGDATA completely.

To do this, it is necessary to have all the WALs produced in the final moments before the switchover from the old master. Changes are identified by comparing the status of the data blocks present in the PGDATA with the changes logged in the WALs. Once the changed blocks are identified, the WALs are replayed, miming a sort of ‘rewind’ of the timelines.

Moreover:

the instances have to be initialised with the “-k” (or --data-checksums) parameter
the parameter wal_log_hints has to be enabled

Until PostgreSQL 9.5, the necessary WALs were the ones starting from the last checkpoint, since it could not go behind this point in the timeline.

To better understand how it works, consider this simple example with a master:

# Set PATH variable
export PATH=/usr/pgsql-9.5/bin:${PATH}

# This is the directory where we will be working on
# Feel free to change it and the rest of the script
# will adapt itself
WORKDIR=/var/lib/pgsql/9.5

# Environment variables for PGDATA and archive directories
MASTER_PGDATA=${WORKDIR}/master
STANDBY1_PGDATA=${WORKDIR}/standby1
ARCHIVE_DIR=${WORKDIR}/archive

# Initialise the cluster
initdb --data-checksums -D ${MASTER_PGDATA}

# Basic configuration of PostgreSQL
cat >> ${MASTER_PGDATA}/postgresql.conf <<EOF
archive_command = 'cp %p ${ARCHIVE_DIR}/%f'
archive_mode = on
wal_level = hot_standby
max_wal_senders = 10
min_wal_size = '32MB'
max_wal_size = '32MB'
hot_standby = on
wal_log_hints = on
EOF

cat >> ${MASTER_PGDATA}/pg_hba.conf <<EOF
# Trust local access for replication
# BE CAREFUL WHEN DOING THIS IN PRODUCTION
local replication replication trust
EOF

# Create the archive directory
mkdir -p ${ARCHIVE_DIR}

# Start the master
pg_ctl -D ${MASTER_PGDATA} -l ${WORKDIR}/master.log start

# Create the replication user
psql -c "CREATE USER replication WITH replication"

(note the small amount of WAL kept in the master), and then a standby:

# Create the first standby
pg_basebackup -D ${STANDBY1_PGDATA} -R -c fast -U replication -x

cat >> ${STANDBY1_PGDATA}/postgresql.conf <<EOF
port = 5433
EOF

# Start the first standby
pg_ctl -D ${STANDBY1_PGDATA} -l ${WORKDIR}/standby1.log start

Let’s insert some data on the master. You will see it also from the (hot) standby.

Now promote the standby, leaving the master as it is:

pg_ctl -D ${STANDBY1_PGDATA} promote

Now if you update the master, no changes will be visible from the standby. Moreover, in the archive/ directory it is possible to see the file 00000002.history, that shows there has been a change in the timeline during the promotion.

Now let’s "rewind" the master, and make it to follow up the promoted standby:

~$ pg_ctl -D ${MASTER_PGDATA} stop
waiting for server to shut down.... done
server stopped
~$ pg_rewind --target-pgdata=${MASTER_PGDATA} \
    --source-server="port=5433 user=postgres dbname=postgres"

note here that for the connection to the source server – the promoted standby – we used the postgres user, since a superuser is needed by pg_rewind to inspect the data blocks.

If the max_wal_size parameter is not large enough to keep the needed WALs into the pg_xlog/ directory of the standby, as I’ve deliberately made before, an error similar to the following one can be obtained:

The servers diverged at WAL position 0/3015938 on timeline 1.
could not open file "/var/lib/pgsql/9.5/master/pg_xlog/000000010000000000000002": No such file or directory

could not find previous WAL record at 0/3015938
Failure, exiting

Now, there are two possible ways to solve this:

manually check the missing WALs in the archive, starting from the one listed in the error message, then copy them to the pg_xlog/ directory of the master
add a proper restore_command in the recovery.conf and place in the PGDATA of the master, so pg_rewind will automatically find the missing WALs

The second option is probably the most suitable method. Think, for instance, if you have the WAL archive managed by Barman: you could base the restore_command to use the get-wal feature of Barman, as explained in this interesting article by Gabriele. Doing so, Barman will be used as a WAL hub, providing all the necessary WAL files to pg_rewind.

Once the missing WALs are retrieved, you can run the pg_rewind command again and the following message should ensure that everything worked properly:

~$ pg_rewind --target-pgdata=${MASTER_PGDATA} \
    --source-server="port=5433 user=postgres dbname=postgres"
servers diverged at WAL position 0/3015938 on timeline 1
rewinding from last common checkpoint at 0/3000140 on timeline 1
Done!

Keep in mind that just few blocks were copied (those changed during the split-brain), even if your database size is hundreds of GBs! Remember that the configurations are also copied, and eventually the already-present recovery.conf in the "downgraded" master has been over-written. So, remember to:

change the port where the instance listens in the postgresql.conf (set it as 5432 in our example)
change the primary_conninfo in the recovery.conf in order to make the downgraded master connect to the promoted master

Once this has been done, start the downgraded master and that will start to follow up the promoted standby and will then be its standby in turn.

Do you have a more complex HA cluster? Don’t worry! Part 2 will explain this more in depth and talk about pg_rewind in PostgreSQL 9.5!

Resource Feature Callout 1

Back to the Future Pt. 1: Introduction to pg_rewind

2ndQuadrant Press

how pg_rewind works

More Blogs

What Happens to My Existing DBMS_JOB and DBMS_SCHEDULER’s Jobs After Upgrading to EPAS 16?

PostgreSQL Disaster Recovery with Kubernetes’ Volume Snapshots

What is a Cloud Database?