PG Phriday: Getting RAD with Docker [Part 1]

October 20, 2017

Fans of Rapid Application Development (RAD!) are probably already familiar with Docker, but what does that have to do with Postgres? Database-driven applications are a dime a dozen these days, and a good RAD environment is something of a Holy Grail to coders and QA departments alike. Docker lets us spin up a Postgres instance in seconds, and discard it with a clean conscience.

There have even been some noises within certain circles about using it in a production context. Can we do something like that responsibly? Docker containers are practically guaranteed to be ephemeral, while production data most decidedly isn’t. The answer to this is ultimately complex, and something we’ll be exploring over the next several weeks.

Let’s get started.

Let There Be Light

Since Docker itself is a commercial product, getting the latest community packages (Docker CE) means making a free account on their website and downloading the packages for your preferred platform. For slightly older versions, RHEL and CentOS variants can install EPEL, where and the docker-io package resides. Debian variants such as Ubuntu can simply install docker.io with apt.

Once we have Docker itself, we can actually start playing with it right away. We should start by grabbing the latest version of Postgres from the official Docker Hub image:

$> docker pull postgres:latest

latest: Pulling from library/postgres
3e17c6eae66c: Pull complete 
3d89ae9a47e4: Pull complete 
f7726fda7efe: Pull complete 
d1838499bd8f: Pull complete 
a5ec5aa60735: Pull complete 
1571d7170291: Pull complete 
0d6e41e13732: Pull complete 
787e3c45a9bb: Pull complete 
7b234cf83b22: Pull complete 
3a8ad2440289: Pull complete 
9351993374c0: Pull complete 
a8f3575e09a1: Pull complete 
a4c4b2ff0c3a: Pull complete 
Digest: sha256:73a1c4e98fb961bb4a5c55ad6428470a3303bd3966abc442fe937814f6bbc002
Status: Downloaded newer image for postgres:latest

$> docker images

REPOSITORY    TAG         IMAGE ID          CREATED           SIZE
postgres      latest      b106d5a0dc75      22 hours ago      287.2 MB

Since Docker images can build on top of each other, we can see that Postgres is comprised of a long chain of dependencies. Despite that, the image itself isn’t very large; 287MB for an entire operating system and database server is pretty good. There’s even smaller versions if we are willing to forgo certain UNIX tools. The ‘postgres:10-alpine’ image for instance, is just under 40MB.

Runnin’ with the Devil

Just having an image is pretty pointless, of course. Let’s start a container and see what’s inside. This is done with the docker run command. We’ll pass it a few parameters to make things easier to use later as well. Setting the name, for example, lets us manipulate the container later without having to use an ugly MD5 hash.

Since Docker containers run on a separate network layer, we want to publish the default Postgres 5432 port to 5555 to prevent conflicts with existing applications on the host system. And finally, we want to detach from the terminal so we can use it for other things, and that’s also probably how we’ll be using the container in any reasonable configuration.

$> docker run --name pg-test -p 5555:5432 -d postgres

69f21490ff415e485501cd9c3ed04336d7c4147f77025b7de8dbf565b11fdf52

$> docker ps

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
69f21490ff41        postgres            "docker-entrypoint.sh"   40 seconds ago      Up 39 seconds       0.0.0.0:5555->5432/tcp   pg-test

$> psql -p 5555 -U postgres -h localhost

Timing is on.
Pager usage is off.
psql (10.0)
Type "help" for help.

postgres=# SELECT version();
                                             version                                              
--------------------------------------------------------------------------------------------------
 PostgreSQL 10.0 on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18) 6.3.0 20170516, 64-bit
(1 row)

Run and done. We can connect to the container on port 5555 as expected, we have the latest Postgres 10, and Postgres confirms its full version and build environment. Let’s try to make some tables.

Playing God

We’ve used the sensor_log table a lot in the past. A good reason for that is due to the sheer size of the data we normally insert. There’s no reason to stop now, is there? Let’s keep our connection and do a couple of basic timings:

\timing on

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

Time: 7.945 ms

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);

Time: 30981.570 ms (00:30.982)

CREATE INDEX idx_sensor_log_location ON sensor_log (location);

Time: 9339.747 ms (00:09.340)

CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Time: 3689.406 ms (00:03.689)

ANALYZE sensor_log;

Time: 247.532 ms

SELECT count(*) FROM sensor_log;

  count  
---------
 5000000

Time: 518.908 ms

SELECT count(*) FROM sensor_log
 WHERE location = '50';

 count 
-------
  5000

Time: 8.243 ms

SELECT count(*) FROM sensor_log
 WHERE reading_date BETWEEN '2016-01-01' AND '2016-06-01';

 count  
--------
 680001
(1 row)

Time: 139.237 ms

One weakness of this approach is that the Postgres data exists entirely within this single container. Since containers are intended to be temporary, that’s not great if we want to use the container as a somewhat bulky linked binary. We want to be able to kill—and even erase—our container in some cases, yet retain the data.

How do we manage that?

Sharing the Wealth

One way to do this with Docker is volumes. We can build a volume and start another container that uses it like this:

$> docker volume create --name pgdata
$> docker run --name pg-mount -p 5600:5432 -d \
              --volume pgdata:/var/lib/postgresql/data postgres

We created a pgdata volume, then used told Docker that the volume should be mounted at /var/lib/postgresql/data. This is the default location used by the maintainers of the official image, so a new container will use the external volume instead of the built-in storage.

Let’s connect to this new container, make a very small proof-of-concept table, then stop and erase the container. Then we can create a new container that uses the same volume, and see if the table is still there.

$> psql -p 5600 -U postgres -h localhost \
        -c "CREATE TABLE foo (id SERIAL);"

$> docker stop pg-mount
$> docker rm pg-mount
$> docker run --name pg-new-mount -p 5600:5432 -d \
              --volume pgdata:/var/lib/postgresql/data postgres

psql -p 5600 -U postgres -h localhost -c "\d foo"

                            Table "public.foo"
 Column |  Type   | Collation | Nullable |             Default             
--------+---------+-----------+----------+---------------------------------
 id     | integer |           | not null | nextval('foo_id_seq'::regclass)

Nice, eh? Now we can test various container variants, or image builds, against the same data volume.

Starting from Scratch

But what about performance? Docker containers utilize Copy on Write (CoW) on top of a proprietary filesystem. Is it possible that this abstraction layer could be interfering with our desired use of the hardware?

One of the handy things Docker can do is mount locations on the host machine. Let’s start another image by mapping to a drive on the Docker host, and see if the timings change at all.

We start by launching a new Docker container with the desired mount. The cool thing about the --volume flag is that it will create the directory for us if it doesn’t already exist. It’ll even give it the correct permissions, since Docker itself runs as a management daemon under the root user, and not the postgres user we’re probably accustomed to.

$> docker run --name pg-local -p 5700:5432 -d \
              --volume /db/docker:/var/lib/postgresql/data postgres

$> ls -l /db

drwx------ 19 vboxadd  root  4096 Oct 13 17:09 docker

Now let’s connect to post 5700 and repeat our sensor_log timings:

\timing on

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);

Time: 7.977 ms

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10)::TEXT || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);

Time: 30180.467 ms (00:30.180)

CREATE INDEX idx_sensor_log_location ON sensor_log (location);

Time: 9279.165 ms (00:09.279)

CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Time: 3268.765 ms (00:03.269)

ANALYZE sensor_log;

Time: 205.313 ms

SELECT count(*) FROM sensor_log;

  count  
---------
 5000000

Time: 483.312 ms

SELECT count(*) FROM sensor_log
 WHERE location = '50';

 count 
-------
  5000

Time: 7.234 ms

SELECT count(*) FROM sensor_log
 WHERE reading_date BETWEEN '2016-01-01' AND '2016-06-01';

 count  
--------
 680001
(1 row)

Time: 141.441 ms

It turns out the filesystem mapping is comparable to Docker’s own filesystem, as the differences are statistically insignificant. Of course, now instead of using its own internal filesystem layer, Docker must maintain a mapping to the local filesystem, so it’s hard to say what’s actually going on behind the scenes. We’re also operating in a relatively puny virtual environment with its own storage performance limitations.

Still, a commendable result, as it leaves us the option use standard filesystem paths. LVM? Sure. XFS? Why not. An external SAN device comprised entirely of SSDs? Of course. There are interesting implications here.

Out of Eden

So are containers really just a bloated linked Postgres binary, since we could have just managed a local directory with a standard Postgres install? Yes and no. We’ll definitely explore more of these concepts in later articles, but it’s important to note that containers are meant for chaining and automation.

A standard Postgres install has a certain amount of administration involved. Maybe it’s just the configuration file, or user creation, or package maintenance. There is a lot to building a Postgres server. Since containers themselves are a tiny Postgres server, we don’t really need to do any of that in simplified contexts.

Did the official Postgres package see an update? We can just update to the latest image, stop the existing container, and start a new one. The old container is still there, so reverting is trivial.

We can make our own image with contrib, and plpython, and pglogical, and launch that anywhere and everywhere. We can hand it to developers that don’t know anything about Postgres, and they can have their own database server that mirrors the production environment. And we can do it without an army of resource-intensive and difficult to provision VMs.

This is just the very tiniest tip of the proverbial iceberg. We’ll continue to build our toolkit in future articles, and maybe end up with something that isn’t arguably worse than a standard server environment, just different.

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020