Fans of Rapid Application Development (RAD!) are probably already familiar with Docker, but what does that have to do with Postgres? Database-driven applications are a dime a dozen these days, and a good RAD environment is something of a Holy Grail to coders and QA departments alike. Docker lets us spin up a Postgres instance in seconds, and discard it with a clean conscience.
There have even been some noises within certain circles about using it in a production context. Can we do something like that responsibly? Docker containers are practically guaranteed to be ephemeral, while production data most decidedly isn’t. The answer to this is ultimately complex, and something we’ll be exploring over the next several weeks.
Let’s get started.
Let There Be Light
Since Docker itself is a commercial product, getting the latest community packages (Docker CE) means making a free account on their website and downloading the packages for your preferred platform. For slightly older versions, RHEL and CentOS variants can install EPEL, where and the docker-io
package resides. Debian variants such as Ubuntu can simply install docker.io
with apt
.
Once we have Docker itself, we can actually start playing with it right away. We should start by grabbing the latest version of Postgres from the official Docker Hub image:
$> docker pull postgres:latest
latest: Pulling from library/postgres
3e17c6eae66c: Pull complete
3d89ae9a47e4: Pull complete
f7726fda7efe: Pull complete
d1838499bd8f: Pull complete
a5ec5aa60735: Pull complete
1571d7170291: Pull complete
0d6e41e13732: Pull complete
787e3c45a9bb: Pull complete
7b234cf83b22: Pull complete
3a8ad2440289: Pull complete
9351993374c0: Pull complete
a8f3575e09a1: Pull complete
a4c4b2ff0c3a: Pull complete
Digest: sha256:73a1c4e98fb961bb4a5c55ad6428470a3303bd3966abc442fe937814f6bbc002
Status: Downloaded newer image for postgres:latest
$> docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
postgres latest b106d5a0dc75 22 hours ago 287.2 MB
Since Docker images can build on top of each other, we can see that Postgres is comprised of a long chain of dependencies. Despite that, the image itself isn’t very large; 287MB for an entire operating system and database server is pretty good. There’s even smaller versions if we are willing to forgo certain UNIX tools. The ‘postgres:10-alpine’ image for instance, is just under 40MB.
Runnin’ with the Devil
Just having an image is pretty pointless, of course. Let’s start a container and see what’s inside. This is done with the docker run
command. We’ll pass it a few parameters to make things easier to use later as well. Setting the name, for example, lets us manipulate the container later without having to use an ugly MD5 hash.
Since Docker containers run on a separate network layer, we want to publish the default Postgres 5432
port to 5555
to prevent conflicts with existing applications on the host system. And finally, we want to detach from the terminal so we can use it for other things, and that’s also probably how we’ll be using the container in any reasonable configuration.
$> docker run --name pg-test -p 5555:5432 -d postgres
69f21490ff415e485501cd9c3ed04336d7c4147f77025b7de8dbf565b11fdf52
$> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
69f21490ff41 postgres "docker-entrypoint.sh" 40 seconds ago Up 39 seconds 0.0.0.0:5555->5432/tcp pg-test
$> psql -p 5555 -U postgres -h localhost
Timing is on.
Pager usage is off.
psql (10.0)
Type "help" for help.
postgres=# SELECT version();
version
--------------------------------------------------------------------------------------------------
PostgreSQL 10.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18) 6.3.0 20170516, 64-bit
(1 row)
Run and done. We can connect to the container on port 5555
as expected, we have the latest Postgres 10, and Postgres confirms its full version and build environment. Let’s try to make some tables.
Playing God
We’ve used the sensor_log
table a lot in the past. A good reason for that is due to the sheer size of the data we normally insert. There’s no reason to stop now, is there? Let’s keep our connection and do a couple of basic timings:
\timing on
CREATE TABLE sensor_log (
sensor_log_id SERIAL PRIMARY KEY,
location VARCHAR NOT NULL,
reading BIGINT NOT NULL,
reading_date TIMESTAMP NOT NULL
);
Time: 7.945 ms
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
FROM generate_series(1, 5000000) s(id);
Time: 30981.570 ms (00:30.982)
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
Time: 9339.747 ms (00:09.340)
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
Time: 3689.406 ms (00:03.689)
ANALYZE sensor_log;
Time: 247.532 ms
SELECT count(*) FROM sensor_log;
count
---------
5000000
Time: 518.908 ms
SELECT count(*) FROM sensor_log
WHERE location = '50';
count
-------
5000
Time: 8.243 ms
SELECT count(*) FROM sensor_log
WHERE reading_date BETWEEN '2016-01-01' AND '2016-06-01';
count
--------
680001
(1 row)
Time: 139.237 ms
One weakness of this approach is that the Postgres data exists entirely within this single container. Since containers are intended to be temporary, that’s not great if we want to use the container as a somewhat bulky linked binary. We want to be able to kill—and even erase—our container in some cases, yet retain the data.
How do we manage that?
Sharing the Wealth
One way to do this with Docker is volumes. We can build a volume and start another container that uses it like this:
$> docker volume create --name pgdata
$> docker run --name pg-mount -p 5600:5432 -d \
--volume pgdata:/var/lib/postgresql/data postgres
We created a pgdata
volume, then used told Docker that the volume should be mounted at /var/lib/postgresql/data
. This is the default location used by the maintainers of the official image, so a new container will use the external volume instead of the built-in storage.
Let’s connect to this new container, make a very small proof-of-concept table, then stop and erase the container. Then we can create a new container that uses the same volume, and see if the table is still there.
$> psql -p 5600 -U postgres -h localhost \
-c "CREATE TABLE foo (id SERIAL);"
$> docker stop pg-mount
$> docker rm pg-mount
$> docker run --name pg-new-mount -p 5600:5432 -d \
--volume pgdata:/var/lib/postgresql/data postgres
psql -p 5600 -U postgres -h localhost -c "\d foo"
Table "public.foo"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------------------------------
id | integer | | not null | nextval('foo_id_seq'::regclass)
Nice, eh? Now we can test various container variants, or image builds, against the same data volume.
Starting from Scratch
But what about performance? Docker containers utilize Copy on Write (CoW) on top of a proprietary filesystem. Is it possible that this abstraction layer could be interfering with our desired use of the hardware?
One of the handy things Docker can do is mount locations on the host machine. Let’s start another image by mapping to a drive on the Docker host, and see if the timings change at all.
We start by launching a new Docker container with the desired mount. The cool thing about the --volume
flag is that it will create the directory for us if it doesn’t already exist. It’ll even give it the correct permissions, since Docker itself runs as a management daemon under the root
user, and not the postgres
user we’re probably accustomed to.
$> docker run --name pg-local -p 5700:5432 -d \
--volume /db/docker:/var/lib/postgresql/data postgres
$> ls -l /db
drwx------ 19 vboxadd root 4096 Oct 13 17:09 docker
Now let’s connect to post 5700
and repeat our sensor_log
timings:
\timing on
CREATE TABLE sensor_log (
sensor_log_id SERIAL PRIMARY KEY,
location VARCHAR NOT NULL,
reading BIGINT NOT NULL,
reading_date TIMESTAMP NOT NULL
);
Time: 7.977 ms
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
FROM generate_series(1, 5000000) s(id);
Time: 30180.467 ms (00:30.180)
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
Time: 9279.165 ms (00:09.279)
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
Time: 3268.765 ms (00:03.269)
ANALYZE sensor_log;
Time: 205.313 ms
SELECT count(*) FROM sensor_log;
count
---------
5000000
Time: 483.312 ms
SELECT count(*) FROM sensor_log
WHERE location = '50';
count
-------
5000
Time: 7.234 ms
SELECT count(*) FROM sensor_log
WHERE reading_date BETWEEN '2016-01-01' AND '2016-06-01';
count
--------
680001
(1 row)
Time: 141.441 ms
It turns out the filesystem mapping is comparable to Docker’s own filesystem, as the differences are statistically insignificant. Of course, now instead of using its own internal filesystem layer, Docker must maintain a mapping to the local filesystem, so it’s hard to say what’s actually going on behind the scenes. We’re also operating in a relatively puny virtual environment with its own storage performance limitations.
Still, a commendable result, as it leaves us the option use standard filesystem paths. LVM? Sure. XFS? Why not. An external SAN device comprised entirely of SSDs? Of course. There are interesting implications here.
Out of Eden
So are containers really just a bloated linked Postgres binary, since we could have just managed a local directory with a standard Postgres install? Yes and no. We’ll definitely explore more of these concepts in later articles, but it’s important to note that containers are meant for chaining and automation.
A standard Postgres install has a certain amount of administration involved. Maybe it’s just the configuration file, or user creation, or package maintenance. There is a lot to building a Postgres server. Since containers themselves are a tiny Postgres server, we don’t really need to do any of that in simplified contexts.
Did the official Postgres package see an update? We can just update to the latest image, stop the existing container, and start a new one. The old container is still there, so reverting is trivial.
We can make our own image with contrib, and plpython, and pglogical, and launch that anywhere and everywhere. We can hand it to developers that don’t know anything about Postgres, and they can have their own database server that mirrors the production environment. And we can do it without an army of resource-intensive and difficult to provision VMs.
This is just the very tiniest tip of the proverbial iceberg. We’ll continue to build our toolkit in future articles, and maybe end up with something that isn’t arguably worse than a standard server environment, just different.