repmgr (Replication Manager) is an open source tool used for managing the replication and failover of PostgreSQL clusters. In this post we will learn to set up and configure the cluster for automatic failover.
Prerequisites
The following software must be installed on both master and standby servers:
- PostgreSQL
- repmgr (matching the installed PostgreSQL major version)
- At the network level, connections with the PostgreSQL port (default: 5432) must be possible in both directions.
Step 1: Install PostgreSQL
Create two clusters/servers with the PostgreSQL installation. You can follow the PostgreSQL instructions at the link below for installation using PostgreSQL’s PGDG repo package. For the sake of naming conventions, we will consider master and standby as two servers.
https://wiki.postgresql.org/wiki/YUM_Installation
yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
yum -y install epel-release yum-utils
yum-config-manager --enable pgdg12
yum install postgresql12-server postgresql12
/usr/pgsql-12/bin/postgresql-12-setup initdb
Note: The above step of initialization of the cluster is not needed on the standby server.
systemctl enable --now postgresql-12
systemctl status postgresql-12
Step 2: Install repmgr
You will need to install repmgr on the master as well as standby.
yum -y install repmgr12*
Step 3: Configure PostgreSQL
On the primary server, a PostgreSQL instance must be initialized and running. The following replication settings may need to be adjusted:
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'hot_standby' or 'replica' or 'logical'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'
Step 4: Create users
Create a dedicated PostgreSQL superuser account and a database for the repmgr metadata:
create user repmgr;
create database repmgr with owner repmgr;
Step 5: Configure pg_hba.conf
Ensure the repmgr user has appropriate permissions in pg_hba.conf and can connect in replication mode; pg_hba.conf should contain entries similar to the following:
local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 192.168.1.0/24 trust
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 192.168.1.0/24 trust
Note: Adjust above settings according to your network configurations.
Step 6: Configure the repmgr file
Create a repmgr.conf on the master server with the following entries:
cluster='failovertest'
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data/'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
Note: The location and file must be accessible for the user we are using for repmgr.
Step 7: Register the primary server
Register the primary server with repmgr:
-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register
WARNING: the following problems were found in the configuration file:
parameter "cluster" is deprecated and will be ignored
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
-bash-4.2$
Then check the status of the cluster:
-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
WARNING: the following problems were found in the configuration file:
parameter "cluster" is deprecated and will be ignored
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2
Step 8: Build/clone the standby server
Create the repmgr.conf file on standby server:
-bash-4.2$ cat repmgr.conf
node_id=2
node_name=node2
conninfo='host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
Note: In the above commands, for the host IP info we need to specify the IP of the standby server. In this example, 172.16.140.137 is my standby server.
We can now perform the dry run and test if our configuration is correct:
-bash-4.2$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met
If there is no problem, start cloning:
-bash-4.2$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone
NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/var/lib/pgsql/12/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
pg_basebackup -l "repmgr base backup" -D /var/lib/pgsql/12/data -h 172.16.140.135 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /var/lib/pgsql/12/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"
Step 9: Register the standby server
Register the standby server with repmgr:
-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register
INFO: connecting to local node "node2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "node2" (ID: 2) successfully registered
-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
1 | node1 | primary | * running | | default | 100 | 1 | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2
2 | node2 | standby | running | node1 | default | 100 | 1 | host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2
-bash-4.2$
Step 10: Start repmgrd daemon process
To enable the automatic failover, we now need to start the repmgrd daemon process on master slave and witness:
For example:
-bash-4.2$ /usr/pgsql-12/bin/repmgrd -f /var/lib/pgsql/repmgr.conf
[2020-02-23 20:44:43] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2020-02-23 20:44:43] [INFO] connecting to database "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
-bash-4.2$ INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2020-02-23 20:44:43] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] "connection_check_type" set to "ping"
[2020-02-23 20:44:43] [NOTICE] monitoring cluster primary "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] child node "node2" (ID: 2) is attached
We can also check the events for the cluster:
-bash-4.2$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster event
Node ID | Name | Event | OK | Timestamp | Details
---------+-------+--------------------+----+---------------------+----------------------------------------------------------------------------------------
2 | node2 | repmgrd_start | t | 2020-02-23 20:46:26 | monitoring connection to upstream node "node1" (ID: 1)
1 | node1 | repmgrd_start | t | 2020-02-23 20:44:43 | monitoring cluster primary "node1" (ID: 1)
2 | node2 | standby_register | t | 2020-02-23 20:39:24 | standby registration succeeded; upstream node ID is 1 (-F/--force option was used)
1 | node1 | primary_register | t | 2020-02-23 20:39:11 | existing primary record updated
2 | node2 | standby_register | t | 2020-02-23 20:38:25 | standby registration succeeded; upstream node ID is 1
2 | node2 | standby_unregister | t | 2020-02-23 20:37:56 |
2 | node2 | standby_register | t | 2020-02-23 20:12:23 | standby registration succeeded; upstream node ID is 1
2 | node2 | standby_clone | t | 2020-02-23 20:09:25 | cloned from host "172.16.140.135", port 5432; backup method: pg_basebackup; --force: N
1 | node1 | primary_register | t | 2020-02-23 19:57:11 |
1 | node1 | cluster_created | t | 2020-02-23 19:57:11 |
Now, if the master server fails, repmgrd will detect that the master is not reachable and then promote the next available server and perform the automatic failover.
The log messages will be as follows:
-bash-4.2$ [2020-02-23 20:51:28] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2020-02-23 20:52:40] [WARNING] unable to ping "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
[2020-02-23 20:52:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
.
.
.
[2020-02-23 20:53:30] [INFO] checking state of node 1, 6 of 6 attempts
[2020-02-23 20:53:30] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.140.135 fallback_application_name=repmgr"
[2020-02-23 20:53:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-23 20:53:30] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-02-23 20:53:30] [INFO] 0 active sibling nodes registered
[2020-02-23 20:53:30] [INFO] primary and this node have the same location ("default")
[2020-02-23 20:53:30] [INFO] no other sibling nodes - we win by default
[2020-02-23 20:53:30] [NOTICE] this node is the only available candidate and will now promote itself
[2020-02-23 20:53:30] [INFO] promote_command is:
"/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file"
[2020-02-23 20:53:30] [NOTICE] promoting standby to primary
[2020-02-23 20:53:30] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()
[2020-02-23 20:53:30] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2020-02-23 20:53:31] [NOTICE] STANDBY PROMOTE successful
[2020-02-23 20:53:31] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2020-02-23 20:53:31] [INFO] 0 followers to notify
[2020-02-23 20:53:31] [INFO] switching to primary monitoring mode
[2020-02-23 20:53:31] [NOTICE] monitoring cluster primary "node2" (ID: 2)
Reference Links:
https://repmgr.org/docs/5.0/index.html
https://repmgr.org/docs/4.4/repmgrd-automatic-failover.html
https://repmgr.org/docs/4.4/repmgrd-basic-configuration.html#REPMGRD-AUTOMATIC-FAILOVER-CONFIGURATION