2ndQuadrant Press

October 17, 2016

This is the third and last part of blog articles dedicated to pg_rewind. In the two previous articles we have seen how pg_rewind is useful to fix split-brain events due to mistakes in the switchover procedures, avoiding the need of new base backups. We have also seen that this is true for simple replication clusters, where more standby nodes are involved. In this case, just two nodes can be fixed, and the other ones need a new base backup to be re-synchronised. pg_rewind for PostgreSQL 9.6 is now able to work with complex replication clusters.

Indeed, pg_rewind has been extended so it can view the timeline history graph of an entire HA cluster, like the one mentioned in my previous blog article. It is able to find out the most recent, shared point in the timeline history between a target and a source node, and then it performs a rewind operation from this point – not simply from the last checkpoint of the master before the standby promotion, as in PostgreSQL 9.5.

So, let’s consider again the same cluster, but now based on PostgreSQL 9.6:

~$ # Set PATH variable
~$ export PATH=/usr/pgsql-9.6/bin:${PATH}
~$ 
~$ # This is the directory where we will be working on
~$ # Feel free to change it and the rest of the script
~$ # will adapt itself
~$ WORKDIR=/var/lib/pgsql/9.6
~$ 
~$ # Environment variables for PGDATA and archive directories
~$ MASTER_PGDATA=${WORKDIR}/master
~$ STANDBY1_PGDATA=${WORKDIR}/standby1
~$ STANDBY2_PGDATA=${WORKDIR}/standby2
~$ ARCHIVE_DIR=${WORKDIR}/archive
~$ 
~$ # Create the archive directory
~$ mkdir -p ${ARCHIVE_DIR}
~$ 
~$ # Create the HA cluster
~$ initdb --data-checksums -D ${WORKDIR}/master
~$ cat >> ${MASTER_PGDATA}/postgresql.conf <<EOF
~$ archive_command = 'cp %p ${ARCHIVE_DIR}/%f'
~$ archive_mode = on
~$ wal_level = hot_standby
~$ max_wal_senders = 10
~$ min_wal_size = '32MB'
~$ max_wal_size = '32MB'
~$ hot_standby = on
~$ wal_log_hints = on
~$ EOF
~$ cat >> ${MASTER_PGDATA}/pg_hba.conf <<EOF
~$ # Trust local access for replication
~$ # BE CAREFUL WHEN DOING THIS IN PRODUCTION
~$ local replication replication trust
~$ EOF
~$ pg_ctl -D /var/lib/pgsql/9.6/master -l ${WORKDIR}/master.log start
~$ psql -c "CREATE USER replication WITH replication"
~$ pg_basebackup -D ${STANDBY1_PGDATA} -R -c fast -U replication -x
~$ echo "port = 5433" >> ${STANDBY1_PGDATA}/postgresql.conf
~$ pg_ctl -D ${STANDBY1_PGDATA} -l ${WORKDIR}/standby.log start
~$ pg_basebackup -D ${STANDBY2_PGDATA} -R -c fast -U replication -x
~$ echo "port = 5434" >> ${STANDBY2_PGDATA}/postgresql.conf
~$ pg_ctl -D ${STANDBY2_PGDATA} -l ${WORKDIR}/standby2.log start

Let’s simulate a wrong switchover, where one of the two standby servers has been promoted as new master, and the master and the other standby remained an independent HA cluster:

~$ pg_ctl -D ${STANDBY1_PGDATA} promote

Here, the promoted standby has the timeline 2, and the other two nodes remained with timeline 1.

Now let’s complete the “split-brain” by making the master and its standby creating a new table: it will be not visible in the promoted standby.

Now the goal is to recreate the original HA cluster, with a master with the same content it had before the split-brain (i.e. without the last created table), and the two standbys.

So, since with PostgreSQL 9.6 pg_rewind is able to make every node in a HA cluster become a master, the idea is to:

Stop the other standby
Stop the old master, and resync it with the promoted standby through pg_rewind
Change the port and the primary_conninfo in the configuration in order to follow up the promoted standby
Resync the other standby with the promoted standby through pg_rewind
Change the port and the primary_conninfo in the configuration in order to follow up the promoted standby

Let’s see the results:

~$ pg_ctl -D ${STANDBY2_PGDATA} stop
waiting for server to shut down.... done
server stopped
~$ pg_ctl -D ${MASTER_PGDATA} stop
waiting for server to shut down.... done
server stopped
~$ pg_rewind --target-pgdata=${MASTER_PGDATA} --source-server="port=5433 user=postgres dbname=postgres"
servers diverged at WAL position 0/A0002C8 on timeline 1
rewinding from last common checkpoint at 0/A000220 on timeline 1
Done!
~$ pg_rewind --target-pgdata=${STANDBY2_PGDATA} --source-server="port=5433 user=postgres dbname=postgres"
servers diverged at WAL position 0/A0002C8 on timeline 1
rewinding from last common checkpoint at 0/A000220 on timeline 1
Done!

Change the configurations of both the old master and the other standby, and than start the two nodes:

~$ pg_ctl -D ${MASTER_PGDATA} start
~$ pg_ctl -D ${STANDBY2_PGDATA} start

Now the three nodes are up, and:

There’s a master with two standbys, as in the beginning
The created table is not visible, so the data content is the same as in the beginning

Cool!! Think about the need to re-synchronise two nodes with a full base backup… :S

Conclusions

pg_rewind is one of the most useful features working with physical replication. It allows you to avoid re-synchronisation through a full base backeup in case of an accidental split-brain.

PostgreSQL 9.6 adds a new, powerful feature to pg_rewind: in the case of HA clusters with a complex structure, it is able to recreate the original status starting independently from every node of the cluster.

As a final recommandation: take care of WAL archiving! Generally, with physical replications, the DBAs base their HA clusters just on streaming connections; to use pg_rewind, you must have configured WAL archiving, so you have a place where to get the needed WALs in the case they are no more available on the master. I recommend considering the use of Barman, with its feature get-wal, to retrieve the eventually needed WALs.

Resource Feature Callout 1

Back to the Future Part 3: pg_rewind with PostgreSQL 9.6

2ndQuadrant Press

Conclusions

More Blogs

What Happens to My Existing DBMS_JOB and DBMS_SCHEDULER’s Jobs After Upgrading to EPAS 16?

PostgreSQL Disaster Recovery with Kubernetes’ Volume Snapshots

What is a Cloud Database?