PG Phriday: Getting RAD with Docker [Part 3]

November 24, 2017

Building an Immortal Cluster

By now we’ve learned about basic Postgres Docker usage and rudimentary clustering. For the uninitiated, constructing a Postgres cluster can be a daunting task, and we’ve greatly simplified the process. So why don’t we take the next logical step and use Docker to deploy a cluster that is effectively immortal as well? 

How is that possible? Why, with repmgr of course! 2ndQuadrant has a tool specifically designed to set up and maintain Postgres clusters. One of the components of repmgr is a daemon that can automatically promote replicas whenever the current primary goes down. Let’s leverage that to make something that’s always online until every node is stopped.

My power is in my own hand

As before, we’re going to need a few scripts to manage the finer points. We can begin by bootstrapping Postgres so repmgr is fully integrated. What are the steps the script should perform? It’s best to keep things isolated, so we need a repmgr user and dedicated database. There’s also a repmgr module we need to install so the daemon works as advertised.

So a basic script might do this:

  1. Create a repmgr superuser.
  2. Create a repmgr database.
  3. Make sure the repmgr user can connect to the repmgr database and the replication stream.
  4. Install the repmgr library by setting shared_preload_libraries in postgresql.conf.
  5. Restart Postgres to activate the module.

Here’s one with all of those elements, which we’ll name 01-add_repmgr.sh:

#!/bin/bash

if [ $(grep -c "replication repmgr" ${PGDATA}/pg_hba.conf) -gt 0 ]; then
    return
fi

createuser -U "$POSTGRES_USER" -s --replication repmgr
createdb -U "$POSTGRES_USER" -O repmgr repmgr

echo "host replication repmgr all trust" >> ${PGDATA}/pg_hba.conf
echo "host all repmgr all trust" >> ${PGDATA}/pg_hba.conf

sed -i "s/#*\(shared_preload_libraries\).*/\1='repmgr'/;" ${PGDATA}/postgresql.conf

pg_ctl -D ${PGDATA} stop -m fast
pg_ctl -D ${PGDATA} start &

sleep 1

There is a bit of a quirk here, in that we stopped Postgres and then started it instead of performing an explicit restart. The entrypoint script in the official Postgres image initializes itself in such a way that on first run, only localhost connections are allowed. We’ll need to connect to the hostname in later scripts, so we started it this way so it listens on all interfaces.

The sleep is necessary because we started Postgres in the background, and we need a tiny sliver of time before the next script runs. Beyond that, it performs exactly the steps we need.

People say you’ve had your day

Next we need to actually configure repmgr. The command-line tools and the daemon rely on a configuration file which is specific to each node. So we need a script that creates that configuration file, and does so in a way that it works for the initial primary node, or any subsequent replica.

That’s a tall order, but we have a few clues as to the cluster state. If this is the first node, it’s automatically the primary. How do we determine that? Why not check for the repmgr.nodes table in the repmgr database? If it’s not the first node, we can simply claim the next node ID. Easy, right?

Here’s how that might work, with a script named 02-conf_repmgr.sh:

#!/bin/bash

if [ -s /etc/repmgr.conf ]; then
    return
fi

PGHOST=${PRIMARY_NAME}

installed=$(psql -qAt -h ${PGHOST} repmgr -c "SELECT 1 FROM pg_tables WHERE tablename='nodes'")
my_node=1

if [ "${installed}" == "1" ]; then
    my_node=$(psql -qAt -h ${PGHOST} repmgr -c 'SELECT max(node_id)+1 FROM repmgr.nodes')
fi

cat</etc/repmgr.conf
node_id=${my_node}
node_name=$(hostname -s | sed 's/\W\{1,\}/_/g;')
conninfo=host=$(hostname -f) user=repmgr dbname=repmgr
data_directory=${PGDATA}

pg_bindir=/usr/lib/postgresql/10/bin
use_replication_slots=1

failover=automatic
promote_command=repmgr standby promote
follow_command=repmgr standby follow -W

service_start_command=pg_ctl -D ${PGDATA} start
service_stop_command=pg_ctl -D ${PGDATA} stop -m fast
service_restart_command=pg_ctl -D ${PGDATA} restart -m fast
service_reload_command=pg_ctl -D ${PGDATA} reload
EOF

The configuration itself has a lot of pieces. Beside the node number, each node should have a name, and a connection string other nodes can use for communication. We also need to know basic paths to the appropriate binaries and the data directory. The use of replication slots prevents replicas from falling behind.

The next settings define how the daemon manages automated failover. We’ve elected to use basic repmgr commands to promote the appropriate standby, and inform other standby nodes to follow the new primary. Since these commands could be anything, more advanced use cases could replace these with a sophisticated management script.

And finally, we set all of the start/stop/etc commands for Postgres itself. These parameters are here to accommodate various environments. If we were using systemd, these would be systemctl commands, for example. These too could be subsequent scripts that do far more than simply manipulate the Postgres service. For now though, let’s keep things simple.

Fly the moon and reach for the stars

Configuring repmgr is merely the beginning. Next we need to register the node we just created.

If this is the primary, our job is simple: register the node as the primary. Otherwise, we need to know the previously assigned node ID, and whether or not the cluster is aware of that ID. For brand new nodes, we create a clone of the upstream primary and register it as a new standby. Simple, no?

Here’s a script that implements the above, which we can name 03-register_node.sh:

#!/bin/bash

PGHOST=${PRIMARY_NAME}

installed=$(psql -qAt -h ${PGHOST} repmgr -c "SELECT 1 FROM pg_tables WHERE tablename='nodes'")

if [ "${installed}" != "1" ]; then
    repmgr primary register
    return
fi

my_node=$(grep node_id /etc/repmgr.conf | cut -d= -f 2)
is_reg=$(psql -qAt -h ${PGHOST} repmgr -c "SELECT 1 FROM repmgr.nodes WHERE node_id=${my_node}")

if [ "${is_reg}" != "1" ] && [ ${my_node} -gt 1 ]; then
    pg_ctl -D ${PGDATA} stop -m fast
    rm -Rf ${PGDATA}/*
    repmgr -h ${PRIMARY_NAME} -d repmgr standby clone --fast-checkpoint
    pg_ctl -D ${PGDATA} start &
    sleep 1
    repmgr -h ${PRIMARY_NAME} -d repmgr standby register    
fi

As with the first script, we need to start Postgres in the background and insert a short sleep so the registration command succeeds. Registration is also the reason we needed to stop/start Postgres in the first script. Whether we call primary register or standby register, repmgr will connect according to the location we specified in conninfo, so Postgres must be listening on the network bridge we set up with Docker.

I know that people talk about me – I hear it every day

Finally we need to launch the repmgrd daemon itself. It will run in the foreground of our Docker container and watch the local Postgres, while also communicating with the other repmgrd daemons.

If the primary becomes unavailable, all remaining nodes will hold a vote, and the winner becomes the new primary. It’s a fairly basic process, but the daemon must be running or the node without a daemon will not follow a newly promoted standby. That’s not ideal, as the node will continue to repeatedly contact the old primary, and could lead to a split-brain scenario if that node doesn’t re-join the cluster properly.

So we have one final script to launch repmgrd, which we’ll name 04-repmgrd.sh:

#!/bin/bash

repmgrd -v 

Thankfully, there’s not much to explain here. Launching repmgrd this way effectively hijacks the entrypoint from the official Postgres image, so that we’ve taken control of the container instead.

Got your world in my hand

Finally, we need to put all of this together with a Dockerfile. Since we’re still inheriting from the latest Postgres image, much of the work is already done. We simply need to install the appropriate repmgr packages, make sure the postgres user can alter the repmgr.conf config file, and install all of the scripts.

Here’s how that might look:

FROM postgres:latest

RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg-testing main 10" \
          >> /etc/apt/sources.list.d/pgdg.list; \
    apt-get update -y; \
    apt-get install -y postgresql-10-repmgr repmgr-common=4.0\*

RUN touch /etc/repmgr.conf; \
    chown postgres:postgres /etc/repmgr.conf

ENV PRIMARY_NAME=localhost

COPY scripts/*.sh /docker-entrypoint-initdb.d/

And for posterity, here is the directory listing for the whole build, including all the scripts we’ve written thus far:

postgres-repmgr
postgres-repmgr/scripts
postgres-repmgr/scripts/01-add_repmgr.sh
postgres-repmgr/scripts/02-conf_repmgr.sh
postgres-repmgr/scripts/03-register_node.sh
postgres-repmgr/scripts/04-repmgrd.sh
postgres-repmgr/Dockerfile

If there’s a lingering question, it could be to ask why we prepended each script with a number. The way the Postgres image entrypoint script works, this will ensure our scripts run in exactly the order specified. We can’t register a node before the configuration file exists, for instance.

Let’s build the image we spent all this time preparing:

docker build --tag postgres-repmgr postgres-repmgr

Sending build context to Docker daemon   7.68kB
Step 1/5 : FROM postgres:latest
 ---> 4860bdf1a517
Step 2/5 : RUN echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg-testing main 10"           >> /etc/apt/sources.list.d/pgdg.list;     apt-get update -y;     apt-get install -y postgresql-10-repmgr repmgr-common=4.0\*
 ---> Using cache
 ---> 97b900b9c6cd
Step 3/5 : RUN touch /etc/repmgr.conf;     chown postgres:postgres /etc/repmgr.conf
 ---> Using cache
 ---> a6ae2bf5b025
Step 4/5 : ENV PRIMARY_NAME localhost
 ---> Using cache
 ---> 0276d345bd34
Step 5/5 : COPY scripts/*.sh /docker-entrypoint-initdb.d/
 ---> Using cache
 ---> af7071c7b600
Successfully built af7071c7b600
Successfully tagged postgres-repmgr:latest

Born to be kings, princes of the universe

What else is left but to create a cluster? In the last article, we used PRIMARY_NAME as a way of informing each replica of the appropriate upstream system. Some readers may have noticed we did something similar this time around. Except now, any system started without this parameter essentially becomes the primary for a new cluster.

That means if we want a 3-node system, we start a single instance without PRIMARY_NAME, and two that follow the container we just created. Like this:

docker run --name pg-repmgr-1 --network pg_stream -d postgres-repmgr
docker run --name pg-repmgr-2 --network pg_stream \
           -e PRIMARY_NAME=pg-repmgr-1 -d postgres-repmgr
docker run --name pg-repmgr-3 --network pg_stream \
           -e PRIMARY_NAME=pg-repmgr-1 -d postgres-repmgr

At this point we have three images, and the hope is that they create a cluster as promised. We can test that assertion by connecting to any container and invoking repmgr cluster show to view the current status of the whole cluster.

Check it out:

docker exec -it pg-repmgr-1 su -c "repmgr cluster show" - postgres

 ID | Name         | Role    | Status    | Upstream     | Location | Connection string                          
----+--------------+---------+-----------+--------------+----------+---------------------------------------------
 1  | 567c8bd22c8f | primary | * running |              | default  | host=567c8bd22c8f user=repmgr dbname=repmgr
 2  | bbe46881180a | standby |   running | 567c8bd22c8f | default  | host=bbe46881180a user=repmgr dbname=repmgr
 3  | dbf05ed46e41 | standby |   running | 567c8bd22c8f | default  | host=dbf05ed46e41 user=repmgr dbname=repmgr

That’s pretty amazing. We now have a Postgres cluster with a management utility, a failover automation daemon, and it was easy to deploy. And we’re still not done!

Here we belong, fighting for survival

Now let’s do something entirely unethical, and kill the primary node. If we wait about a minute with the default settings, the cluster should promote either node 2 or 3. Since we stopped node 1, we need to get the cluster status from node 2 or 3 as well, but that’s not a problem for repmgr.

docker stop pg-repmgr-1
sleep 70
docker exec -it pg-repmgr-2 su -c "repmgr cluster show" - postgres

 ID | Name         | Role    | Status    | Upstream     | Location | Connection string                          
----+--------------+---------+-----------+--------------+----------+---------------------------------------------
 1  | 567c8bd22c8f | primary | - failed  |              | default  | host=567c8bd22c8f user=repmgr dbname=repmgr
 2  | bbe46881180a | standby |   running | dbf05ed46e41 | default  | host=bbe46881180a user=repmgr dbname=repmgr
 3  | dbf05ed46e41 | primary | * running |              | default  | host=dbf05ed46e41 user=repmgr dbname=repmgr

Now we can see that in this case, node 3 promoted itself to be the new primary, and node 2 automatically followed it. Can’t as for much more than that.

In a world with the darkest powers

Or can we?

One unfortunately reality with Docker containers is that they’re restricted to the ecosystem provided by Docker. While it’s great our cluster can keep itself running at all times, how do we contact the primary node?

Remember how we can specify an arbitrary command or script to promote a node? Normally that script could promote the local node, and then move a predetermined virtual IP resource. Yet Docker doesn’t really have a system that allows an individual container to claim a virtual IP address. So at this point, if an application wanted to use this cluster, it would have to poll the state of the cluster intermittently to find the current primary.

This also makes it difficult to map ports to the local host system for outside connectivity. Each container would need to map a separate port, which makes auto-provisioning exceedingly awkward. Afterwards, do we connect to 5555, 5556, 5557, ad infinitum until a node responds?

While only cosmetic, we’re also stuck with those ugly generated container strings. Though we can artificially name our containers and refer to those names on the network bridge, Docker provides no way to find the “nice” name of the current container. Thus we can’t use it in the /etc/repmgr.conf file that determines the cluster configuration.

And finally, we’re in the embarrassing situation where crashed nodes are permanently secluded from the cluster unless we manually connect to the node and do this:

repmgr -h [upstream-master] -d repmgr node rejoin

It would be nice if our startup scripts noticed that situation and included some self-healing. Such a modification is entirely possible, it’s just that the required logic is somewhat beyond the scope of this particular article. We wanted to keep things simple, after all.

We’ve come to be the rulers of you all

Future parts of this series will address at least some of these complaints. For now, we have a handy little toy that lets us build up and tear down entire repmgr clusters at will. We can introspect the cluster for the the current primary node, and with that information, we can interact with the cluster as a whole.

Buckle up in any case, because things are about to get a lot more interesting.

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020