In our last article, we explored how to run Postgres in some very basic Docker scenarios. Based on our experiments back then, we can obtain images, create containers, and mount to various types of storage.
Boring!
It’s not just boring, it’s mundane. It doesn’t do anything. Sure we can run Postgres in a container, but that’s true about a lot of things. You know what’s exciting? Setting up Postgres streaming replication between two docker containers.
Let’s get started.
He say “I know you, you know me”
The first thing we need to do is create a container that will be our primary host. Postgres requires a user with REPLICATION
permission, as well as a special entry in the pg_hba.conf
file for the “replication” pseudo-database. We could start a regular Postgres container, connect to it, and set all of that up ourselves, but it’s much cooler to create our own image which does that for us.
The Postgres Docker Hub image has excellent instructions for extending the official image, so let’s start there.
Since we know the the two modifications we need to make, let’s create a basic script to automate the steps:
#!/bin/bash
if [ $(grep -c "replication repuser" ${PGDATA}/pg_hba.conf) -gt 0 ]; then
exit 0
fi
psql -U "$POSTGRES_USER" \
-c "CREATE USER repuser WITH REPLICATION" postgres
echo "host replication repuser all trust" >> ${PGDATA}/pg_hba.conf
pg_ctl -D ${PGDATA} reload
Since we could be attaching to a pre-existing Postgres volume, we need to check whether or not we already modified it so the repuser
user could connect. Aside from that, all we do is create the user and add a very insecure host entry so it can connect to initiate streaming replication.
He got hair down to his knee
The next step is to create a Dockerfile
to define our image. We didn’t do much, so that’s a thankfully easy task. Here’s how that looks:
FROM postgres:latest
COPY scripts/enable_replication.sh /docker-entrypoint-initdb.d
We saved our previous script as scripts/enable_replication.sh
to keep things together. The official Postgres image suggests we can copy that file to a special location and it’ll run automatically. Not only is that convenient, but it keeps us from having to know a lot of Docker syntax at this point.
Here’s the full contents of the folder where we’re constructing our image source:
postgres-primary/
postgres-primary/scripts
postgres-primary/scripts/enable_replication.sh
postgres-primary/Dockerfile
And now we just create our image:
$> docker build --tag postgres-primary postgres-primary
Sending build context to Docker daemon 5.632kB
Step 1/2 : FROM postgres:latest
---> 4860bdf1a517
Step 2/2 : COPY scripts/enable_replication.sh /docker-entrypoint-initdb.d
---> 03935680f9e2
Successfully built 03935680f9e2
Successfully tagged postgres-primary:latest
Got to be a joker he just do what he please
That’s one image out of the way, but a replication stream requires at least one more component: a replica. To build that, we actually have a slight problem because we need to know the hostname of the container we’re cloning. We can remove that roadblock by requiring users to pass an environment variable with that information.
We also need to stop and erase any existing Postgres instance created by the official image, while also refraining from erasing a legitimate replica. It’s a tall order, but this very basic script has all the requisite parts:
#!/bin/bash
set -e
# This is where we tell the invoker to set PRIMARY_NAME.
if [ "${PRIMARY_NAME}" == "unset" ]; then
echo "ERROR: Please set PRIMARY_NAME to stream source container."
exit 1
fi
# Don't erase an existing replica. This means we've already bootstrapped.
if [ -f ${PGDATA}/recovery.conf ]; then
exit 0
fi
pg_ctl -D ${PGDATA} stop -m fast
rm -Rf ${PGDATA}/*
pg_basebackup -U repuser -X stream -R -c fast \
-h ${PRIMARY_NAME} -D ${PGDATA}
pg_ctl -D ${PGDATA} start
The “trick” to this script is that we look for a recovery.conf
file. That tells us a replica already exists in this instance, and we can skip replacing it with an upstream clone. A more “correct” script might actually check the file to ensure the replica is subscribed to the indicated PRIMARY_NAME
, but this is just a demo so it’s OK to be a little sloppy.
He got feet down below his knee
As before, we need another Dockerfile
to actually coordinate the moving parts, such as they are. This time we also need to define our environment variable along with copying the script.
FROM postgres:latest
ENV PRIMARY_NAME=unset
COPY scripts/bootstrap_replica.sh /docker-entrypoint-initdb.d
The ENV
command requires we set the variable to something, so we just use ‘unset’. Beyond that, there’s really not much else to do. All the heavy lifting is still in the official Postgres image.
Here’s the full contents of the folder where we’re constructing our image source:
postgres-replica/
postgres-replica/scripts
postgres-replica/scripts/bootstrap_replica.sh
postgres-replica/Dockerfile
And now we just create our image:
$> docker build --tag postgres-replica postgres-replica
Sending build context to Docker daemon 3.584kB
Step 1/3 : FROM postgres:latest
---> 4860bdf1a517
Step 2/3 : ENV PRIMARY_NAME unset
---> Running in 0a119ba1f6c8
---> 2f389fa04d93
Removing intermediate container 0a119ba1f6c8
Step 3/3 : COPY scripts/bootstrap_replica.sh /docker-entrypoint-initdb.d
---> 90796d5559b7
Successfully built 90796d5559b7
Successfully tagged postgres-replica:latest
He got muddy water, he one mojo filter
One of the nice things about Docker is that it has built-in networking capabilities. Networking can be an unwieldy and cantankerous Rube Goldberg contraption of the highest order, but Docker tucks all that away into some unholy dark recesses where it can’t hurt us.
All we need to know is that we can create a network bridge. Once we have a network bridge, we can assign it to newly launched containers. This gives all containers assigned to that bridge the same network operational context. Nodes are labeled after the container name, making it trivial to reference them later.
Making the bridge itself is pretty darn easy:
$> docker network create --driver bridge pg_stream
d7fbe614030ce14485c219db1f6e23d07a373c1667877dfaa6a5f1e287a51c9a
As with everything Docker, we get a hash designating our network. Now let’s put it to use!
He say, “One and one and one is three”
Now comes the fun part. In our last article, we exposed the network interface to the host machine, but this time we’re going to specifically avoid that. Port mapping is ugly and there are better ways to connect to containers, especially when they can interact in their own network sandbox.
Let’s begin by actually launching the containers:
$> docker run --name pg-primary --network pg_stream -d postgres-primary
96d1a822e436570708e5e3de39ada41c1f07077795cff64f2796fd7afe1aa213
$> docker run --name pg-replica --network pg_stream -d \
-e PRIMARY_NAME=pg-primary postgres-replica
9077a33bf25be4553b037137df5d43fb67f4ece0df4db69cc40cd384bb518df8
Since running a daemonized container also hides the console output, we need some way of verifying those output hashes mean everything worked as expected. As it turns out, we can connect directly to the container of the primary node and outright ask it if there are any streaming replicas connected.
Let’s do that now:
$> docker exec -it pg-primary psql -U postgres -x \
-c "SELECT * FROM pg_stat_replication"
-[ RECORD 1 ]----+------------------------------
pid | 76
usesysid | 16384
usename | repuser
application_name | walreceiver
client_addr | 172.18.0.3
client_hostname |
client_port | 39600
backend_start | 2017-11-05 23:44:01.613458+00
backend_xmin |
state | streaming
sent_lsn | 0/3000060
write_lsn | 0/3000060
flush_lsn | 0/3000060
replay_lsn | 0/3000060
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
Well how about that? One streaming replica. How about another?
$> docker run --name pg-replica2 --network pg_stream -d \
-e PRIMARY_NAME=pg-primary postgres-replica
3e00427e330ab202a3f8ba1c64823c8b67bfaa134da1f3083c8f5caa05fa8241
$> docker exec -it pg-primary psql -U postgres \
-c "SELECT pid, client_addr FROM pg_stat_replication"
pid | client_addr
-----+-------------
76 | 172.18.0.3
97 | 172.18.0.4
Has it ever been easier to clone a Postgres node?
One thing I can tell you is you got to be free
The cool thing about the replica recipe is that it works for any Postgres container. If we followed some of the steps in the last article, we may have created the Postgres data directory on the native filesystem. Maybe that’s even where it started, and we’re just using a temporary container to investigate its contents.
So long as the primary container is running, we can clone it over and over again. The replicas can also live on the native filesystem, and so long as there’s a recovery.conf
file, the replica will run normally. This also means if we promote a replica and restart the container, it will clone upstream again.
If we want to avoid that, we can write a more advanced bootstrap script that also checks for recovery.done
, and if found, applies pg_rewind. That will allow us to follow the primary again without all the overhead of rebuilding the replica from scratch.
There are obviously more safety checks we could apply, but as a proof-of-concept, this is actually slightly ridiculous. Someone armed with these two images can create a primary Postgres node and an army of replicas with minimal knowledge about Postgres itself. It’s a set of black boxes perfect for stamping out a very simple cluster.
You can’t really get more RAD than that.