Building an Immortal Cluster
By now we’ve learned about basic Postgres Docker usage and rudimentary clustering. For the uninitiated, constructing a Postgres cluster can be a daunting task, and we’ve greatly simplified the process. So why don’t we take the next logical step and use Docker to deploy a cluster that is effectively immortal as well?
How is that possible? Why, with repmgr of course! 2ndQuadrant has a tool specifically designed to set up and maintain Postgres clusters. One of the components of repmgr is a daemon that can automatically promote replicas whenever the current primary goes down. Let’s leverage that to make something that’s always online until every node is stopped.
My power is in my own hand
As before, we’re going to need a few scripts to manage the finer points. We can begin by bootstrapping Postgres so repmgr is fully integrated. What are the steps the script should perform? It’s best to keep things isolated, so we need a repmgr
user and dedicated database. There’s also a repmgr
module we need to install so the daemon works as advertised.
So a basic script might do this:
- Create a
repmgr
superuser. - Create a
repmgr
database. - Make sure the
repmgr
user can connect to therepmgr
database and the replication stream. - Install the
repmgr
library by settingshared_preload_libraries
inpostgresql.conf
. - Restart Postgres to activate the module.
Here’s one with all of those elements, which we’ll name 01-add_repmgr.sh
:
#!/bin/bash
if [ $(grep -c "replication repmgr" ${PGDATA}/pg_hba.conf) -gt 0 ]; then
return
fi
createuser -U "$POSTGRES_USER" -s --replication repmgr
createdb -U "$POSTGRES_USER" -O repmgr repmgr
echo "host replication repmgr all trust" >> ${PGDATA}/pg_hba.conf
echo "host all repmgr all trust" >> ${PGDATA}/pg_hba.conf
sed -i "s/#*\(shared_preload_libraries\).*/\1='repmgr'/;" ${PGDATA}/postgresql.conf
pg_ctl -D ${PGDATA} stop -m fast
pg_ctl -D ${PGDATA} start &
sleep 1
There is a bit of a quirk here, in that we stopped Postgres and then started it instead of performing an explicit restart. The entrypoint script in the official Postgres image initializes itself in such a way that on first run, only localhost connections are allowed. We’ll need to connect to the hostname in later scripts, so we started it this way so it listens on all interfaces.
The sleep is necessary because we started Postgres in the background, and we need a tiny sliver of time before the next script runs. Beyond that, it performs exactly the steps we need.
People say you’ve had your day
Next we need to actually configure repmgr. The command-line tools and the daemon rely on a configuration file which is specific to each node. So we need a script that creates that configuration file, and does so in a way that it works for the initial primary node, or any subsequent replica.
That’s a tall order, but we have a few clues as to the cluster state. If this is the first node, it’s automatically the primary. How do we determine that? Why not check for the repmgr.nodes
table in the repmgr
database? If it’s not the first node, we can simply claim the next node ID. Easy, right?
Here’s how that might work, with a script named 02-conf_repmgr.sh
:
#!/bin/bash
if [ -s /etc/repmgr.conf ]; then
return
fi
PGHOST=${PRIMARY_NAME}
installed=$(psql -qAt -h ${PGHOST} repmgr -c "SELECT 1 FROM pg_tables WHERE tablename='nodes'")
my_node=1
if [ "${installed}" == "1" ]; then
my_node=$(psql -qAt -h ${PGHOST} repmgr -c 'SELECT max(node_id)+1 FROM repmgr.nodes')
fi
cat</etc/repmgr.conf
node_id=${my_node}
node_name=$(hostname -s | sed 's/\W\{1,\}/_/g;')
conninfo=host=$(hostname -f) user=repmgr dbname=repmgr
data_directory=${PGDATA}
pg_bindir=/usr/lib/postgresql/10/bin
use_replication_slots=1
failover=automatic
promote_command=repmgr standby promote
follow_command=repmgr standby follow -W
service_start_command=pg_ctl -D ${PGDATA} start
service_stop_command=pg_ctl -D ${PGDATA} stop -m fast
service_restart_command=pg_ctl -D ${PGDATA} restart -m fast
service_reload_command=pg_ctl -D ${PGDATA} reload
EOF
The configuration itself has a lot of pieces. Beside the node number, each node should have a name, and a connection string other nodes can use for communication. We also need to know basic paths to the appropriate binaries and the data directory. The use of replication slots prevents replicas from falling behind.
The next settings define how the daemon manages automated failover. We’ve elected to use basic repmgr commands to promote the appropriate standby, and inform other standby nodes to follow the new primary. Since these commands could be anything, more advanced use cases could replace these with a sophisticated management script.
And finally, we set all of the start/stop/etc commands for Postgres itself. These parameters are here to accommodate various environments. If we were using systemd
, these would be systemctl
commands, for example. These too could be subsequent scripts that do far more than simply manipulate the Postgres service. For now though, let’s keep things simple.
Fly the moon and reach for the stars
Configuring repmgr
is merely the beginning. Next we need to register the node we just created.
If this is the primary, our job is simple: register the node as the primary. Otherwise, we need to know the previously assigned node ID, and whether or not the cluster is aware of that ID. For brand new nodes, we create a clone of the upstream primary and register it as a new standby. Simple, no?
Here’s a script that implements the above, which we can name 03-register_node.sh
:
#!/bin/bash
PGHOST=${PRIMARY_NAME}
installed=$(psql -qAt -h ${PGHOST} repmgr -c "SELECT 1 FROM pg_tables WHERE tablename='nodes'")
if [ "${installed}" != "1" ]; then
repmgr primary register
return
fi
my_node=$(grep node_id /etc/repmgr.conf | cut -d= -f 2)
is_reg=$(psql -qAt -h ${PGHOST} repmgr -c "SELECT 1 FROM repmgr.nodes WHERE node_id=${my_node}")
if [ "${is_reg}" != "1" ] && [ ${my_node} -gt 1 ]; then
pg_ctl -D ${PGDATA} stop -m fast
rm -Rf ${PGDATA}/*
repmgr -h ${PRIMARY_NAME} -d repmgr standby clone --fast-checkpoint
pg_ctl -D ${PGDATA} start &
sleep 1
repmgr -h ${PRIMARY_NAME} -d repmgr standby register
fi
As with the first script, we need to start Postgres in the background and insert a short sleep so the registration command succeeds. Registration is also the reason we needed to stop/start Postgres in the first script. Whether we call primary register
or standby register
, repmgr will connect according to the location we specified in conninfo
, so Postgres must be listening on the network bridge we set up with Docker.
I know that people talk about me – I hear it every day
Finally we need to launch the repmgrd
daemon itself. It will run in the foreground of our Docker container and watch the local Postgres, while also communicating with the other repmgrd
daemons.
If the primary becomes unavailable, all remaining nodes will hold a vote, and the winner becomes the new primary. It’s a fairly basic process, but the daemon must be running or the node without a daemon will not follow a newly promoted standby. That’s not ideal, as the node will continue to repeatedly contact the old primary, and could lead to a split-brain scenario if that node doesn’t re-join the cluster properly.
So we have one final script to launch repmgrd
, which we’ll name 04-repmgrd.sh
:
#!/bin/bash
repmgrd -v
Thankfully, there’s not much to explain here. Launching repmgrd
this way effectively hijacks the entrypoint from the official Postgres image, so that we’ve taken control of the container instead.
Got your world in my hand
Finally, we need to put all of this together with a Dockerfile
. Since we’re still inheriting from the latest Postgres image, much of the work is already done. We simply need to install the appropriate repmgr
packages, make sure the postgres
user can alter the repmgr.conf
config file, and install all of the scripts.
Here’s how that might look:
FROM postgres:latest
RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg-testing main 10" \
>> /etc/apt/sources.list.d/pgdg.list; \
apt-get update -y; \
apt-get install -y postgresql-10-repmgr repmgr-common=4.0\*
RUN touch /etc/repmgr.conf; \
chown postgres:postgres /etc/repmgr.conf
ENV PRIMARY_NAME=localhost
COPY scripts/*.sh /docker-entrypoint-initdb.d/
And for posterity, here is the directory listing for the whole build, including all the scripts we’ve written thus far:
postgres-repmgr
postgres-repmgr/scripts
postgres-repmgr/scripts/01-add_repmgr.sh
postgres-repmgr/scripts/02-conf_repmgr.sh
postgres-repmgr/scripts/03-register_node.sh
postgres-repmgr/scripts/04-repmgrd.sh
postgres-repmgr/Dockerfile
If there’s a lingering question, it could be to ask why we prepended each script with a number. The way the Postgres image entrypoint script works, this will ensure our scripts run in exactly the order specified. We can’t register a node before the configuration file exists, for instance.
Let’s build the image we spent all this time preparing:
docker build --tag postgres-repmgr postgres-repmgr
Sending build context to Docker daemon 7.68kB
Step 1/5 : FROM postgres:latest
---> 4860bdf1a517
Step 2/5 : RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg-testing main 10" >> /etc/apt/sources.list.d/pgdg.list; apt-get update -y; apt-get install -y postgresql-10-repmgr repmgr-common=4.0\*
---> Using cache
---> 97b900b9c6cd
Step 3/5 : RUN touch /etc/repmgr.conf; chown postgres:postgres /etc/repmgr.conf
---> Using cache
---> a6ae2bf5b025
Step 4/5 : ENV PRIMARY_NAME localhost
---> Using cache
---> 0276d345bd34
Step 5/5 : COPY scripts/*.sh /docker-entrypoint-initdb.d/
---> Using cache
---> af7071c7b600
Successfully built af7071c7b600
Successfully tagged postgres-repmgr:latest
Born to be kings, princes of the universe
What else is left but to create a cluster? In the last article, we used PRIMARY_NAME
as a way of informing each replica of the appropriate upstream system. Some readers may have noticed we did something similar this time around. Except now, any system started without this parameter essentially becomes the primary for a new cluster.
That means if we want a 3-node system, we start a single instance without PRIMARY_NAME
, and two that follow the container we just created. Like this:
docker run --name pg-repmgr-1 --network pg_stream -d postgres-repmgr
docker run --name pg-repmgr-2 --network pg_stream \
-e PRIMARY_NAME=pg-repmgr-1 -d postgres-repmgr
docker run --name pg-repmgr-3 --network pg_stream \
-e PRIMARY_NAME=pg-repmgr-1 -d postgres-repmgr
At this point we have three images, and the hope is that they create a cluster as promised. We can test that assertion by connecting to any container and invoking repmgr cluster show
to view the current status of the whole cluster.
Check it out:
docker exec -it pg-repmgr-1 su -c "repmgr cluster show" - postgres
ID | Name | Role | Status | Upstream | Location | Connection string
----+--------------+---------+-----------+--------------+----------+---------------------------------------------
1 | 567c8bd22c8f | primary | * running | | default | host=567c8bd22c8f user=repmgr dbname=repmgr
2 | bbe46881180a | standby | running | 567c8bd22c8f | default | host=bbe46881180a user=repmgr dbname=repmgr
3 | dbf05ed46e41 | standby | running | 567c8bd22c8f | default | host=dbf05ed46e41 user=repmgr dbname=repmgr
That’s pretty amazing. We now have a Postgres cluster with a management utility, a failover automation daemon, and it was easy to deploy. And we’re still not done!
Here we belong, fighting for survival
Now let’s do something entirely unethical, and kill the primary node. If we wait about a minute with the default settings, the cluster should promote either node 2 or 3. Since we stopped node 1, we need to get the cluster status from node 2 or 3 as well, but that’s not a problem for repmgr.
docker stop pg-repmgr-1
sleep 70
docker exec -it pg-repmgr-2 su -c "repmgr cluster show" - postgres
ID | Name | Role | Status | Upstream | Location | Connection string
----+--------------+---------+-----------+--------------+----------+---------------------------------------------
1 | 567c8bd22c8f | primary | - failed | | default | host=567c8bd22c8f user=repmgr dbname=repmgr
2 | bbe46881180a | standby | running | dbf05ed46e41 | default | host=bbe46881180a user=repmgr dbname=repmgr
3 | dbf05ed46e41 | primary | * running | | default | host=dbf05ed46e41 user=repmgr dbname=repmgr
Now we can see that in this case, node 3 promoted itself to be the new primary, and node 2 automatically followed it. Can’t as for much more than that.
In a world with the darkest powers
Or can we?
One unfortunately reality with Docker containers is that they’re restricted to the ecosystem provided by Docker. While it’s great our cluster can keep itself running at all times, how do we contact the primary node?
Remember how we can specify an arbitrary command or script to promote a node? Normally that script could promote the local node, and then move a predetermined virtual IP resource. Yet Docker doesn’t really have a system that allows an individual container to claim a virtual IP address. So at this point, if an application wanted to use this cluster, it would have to poll the state of the cluster intermittently to find the current primary.
This also makes it difficult to map ports to the local host system for outside connectivity. Each container would need to map a separate port, which makes auto-provisioning exceedingly awkward. Afterwards, do we connect to 5555, 5556, 5557, ad infinitum until a node responds?
While only cosmetic, we’re also stuck with those ugly generated container strings. Though we can artificially name our containers and refer to those names on the network bridge, Docker provides no way to find the “nice” name of the current container. Thus we can’t use it in the /etc/repmgr.conf
file that determines the cluster configuration.
And finally, we’re in the embarrassing situation where crashed nodes are permanently secluded from the cluster unless we manually connect to the node and do this:
repmgr -h [upstream-master] -d repmgr node rejoin
It would be nice if our startup scripts noticed that situation and included some self-healing. Such a modification is entirely possible, it’s just that the required logic is somewhat beyond the scope of this particular article. We wanted to keep things simple, after all.
We’ve come to be the rulers of you all
Future parts of this series will address at least some of these complaints. For now, we have a handy little toy that lets us build up and tear down entire repmgr clusters at will. We can introspect the cluster for the the current primary node, and with that information, we can interact with the cluster as a whole.
Buckle up in any case, because things are about to get a lot more interesting.