PG Phriday: Postgres on ZFS

December 29, 2017

ZFS is a filesystem originally created by Sun Microsystems, and has been available for BSD over a decade. While Postgres will run just fine on BSD, most Postgres installations are historically Linux-based systems. ZFS on Linux has had much more of a rocky road to integration due to perceived license incompatibilities.

As a consequence, administrators were reluctant or outright refused to run ZFS on their Linux clusters. It wasn’t until OpenZFS was introduced in 2013 that this slowly began to change. These days, ZFS and Linux are starting to become more integrated, and Canonical of Ubuntu fame even announced direct support for ZFS in their 16.04 LTS release.

So how can a relatively obscure filesystem designed by a now-defunct hardware and software company help Postgres? Let’s find out!

Eddie waited til he finished high school

Old server hardware is dirt cheap these days, and make for a perfect lab for testing suspicious configurations. This is the server we’ll be using for these tests for those following along at home, or want some point of reference:

  • Dell R710
  • x2 Intel X5660 CPUs, for up to 24 threads
  • 64GB RAM
  • x4 1TB 7200RPM SATA HDDs
  • H200 RAID card configured for Host Bus Adapter (HBA) mode
  • 250GB Samsung 850 EVO SSD

The H200 is particularly important, as ZFS acts as its own RAID system. It also has its own checksumming and other algorithms that don’t like RAID cards getting in the way. As such, we put the card itself in a mode that facilitates this use case.

Due to that, we lose out on any battery-backed write cache the RAID card might offer. To make up for it, it’s fairly common to use an SSD or other persistent fast storage to act both as a write cache, and a read cache. This also transforms our HDDs into hybrid storage automatically, which is a huge performance boost on a budget.

She had a guitar and she taught him some chords

First things first: we need a filesystem. This hardware has four 1TB HDDs, and a 250GB SSD. To keep this article from being too long, we’ve already placed GPT partition tables on all the HDDs, and split the SSD into 50GB for the OS, 32GB for the write cache, and 150GB for the read cache. A more robust setup would probably use separate SSDs or a mirrored pair for these, but labs are fair game.

We begin by creating the ZFS pool itself and showing the status so we know it worked:

$> zpool create tank -o ashift=12 \
    mirror sdb1 sdd1 \
    mirror sdc1 sde1 \
       log sda4 \
     cache sda5

$> zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Sun Dec 10 00:24:02 2017
config:

    NAME                            STATE     READ WRITE CKSUM
    tank                            ONLINE       0     0     0
      mirror-0                      ONLINE       0     0     0
        sdb1                        ONLINE       0     0     0
        sdd1                        ONLINE       0     0     0
      mirror-1                      ONLINE       0     0     0
        sdc1                        ONLINE       0     0     0
        sde1                        ONLINE       0     0     0
    logs
      sda4                          ONLINE       0     0     0
    cache
      sda5                          ONLINE       0     0     0

errors: No known data errors

That first zpool command created the equivalent of a 4-disk RAID-10 with separate read and write cache. Besides being disturbingly easy, it also mounted the filesystem automatically. We could literally start storing data on it immediately, but let’s take some time to tweak it first.

Compression isn’t always enabled by default, and it’s usually common to disable atime so every read doesn’t have a corresponding disk write. We should probably also create a separate area for the database storage to keep things tidy.

zfs set compression=lz4 tank
zfs set atime=off tank
zfs set relatime=on tank

zfs create tank/db

This test is also taking place on a Debian system, so we have access to several cluster creation and configuration utilities. Let’s create a dedicated ZFS instance and set some common configuration parameters:

pg_createcluster 10 zfs -D /tank/db
systemctl daemon-reload

pg_conftool 10 zfs set shared_buffers 4GB
pg_conftool 10 zfs set work_mem 12MB
pg_conftool 10 zfs set maintenance_work_mem 1GB
pg_conftool 10 zfs set random_page_cost 2.0
pg_conftool 10 zfs set effective_cache_size 40GB

systemctl start postgresql@10-zfs

createdb pgbench
pgbench -i -s 100 pgbench

At the end we also created a pgbench database and initialized it with 10M records for testing purposes. Let’s get started!

They moved into a place they both could afford

Let’s start by getting a performance baseline for the hardware. We might expect peak performance at 12 or 24 threads because the server has 12 real CPUs and 24 threads, but query throughput actually topped out at concurrent 32 processes. We can scratch our heads over this later, for now, we can consider it the maximum capabilities of this hardware.

Here’s a small sample:

$> pgbench -S -j 32 -c 32 -M prepared -T 20 pgbench

...
tps = 264661.135288 (including connections establishing)
tps = 264849.345595 (excluding connections establishing)

So far, this is pretty standard behavior. 260k prepared queries per second is great read performance, but this is supposed to be a filesystem demonstration. Let’s get ZFS involved.

The papers said Ed always played from the heart

Let’s repeat that same test with writes enabled. Once that happens, filesystem syncs, dirty pages, WAL overhead, and other things should drastically reduce overall throughput. That’s an expected result, but how much are we looking at, here?

$> pgbench -j 32 -c 32 -M prepared -T 10 pgbench

...
tps = 6153.877658 (including connections establishing)
tps = 6162.392166 (excluding connections establishing)

SSD cache or not, storage overhead is a painful reality. Still, 6000 TPS with writes enabled is a great result for this hardware. Or is it? Can we actually do better?

Consider the Postgres full_page_writes parameter. Tomas Vondra has written about it in the past as a necessity to prevent WAL corruption due to partial writes. The WAL is both streaming replication and crash recovery, so its integrity is of utmost importance. As a result, this is one parameter almost everyone should leave alone.

ZFS is Copy on Write (CoW). As a result, it’s not possible to have a torn page because a page can’t be partially written without reverting to the previous copy. This means we can actually turn off full_page_writes in the Postgres config. The results are some fairly startling performance gains:

$> pgbench -j 32 -c 32 -M prepared -T 10 pgbench

...
tps = 10325.200812 (including connections establishing)
tps = 10336.807218 (excluding connections establishing)

That’s nearly a 70% improvement. Due to write amplification caused by full page writes, Postgres produced 1.2GB of WAL files during a 1-minute pgbench test, but only 160MB with full page writes disabled.

To be fair, a 32-thread pgbench write test is extremely abusive and certainly not a typical usage scenario. However, ZFS just ensured our storage a much lower write load by altering one single parameter. That means the capabilities of the hardware have also been extended to higher write workloads as IO bandwidth is not being consumed by WAL traffic.

They both met movie stars, partied and mingled

Astute readers may have noticed we didn’t change the default ZFS block size from 128k to align with the Postgres default of 8kb. This is what happens when we do that:

$> zfs get compressratio tank/db

NAME     PROPERTY       VALUE  SOURCE
tank/db  compressratio  1.71x  -

Now let’s compare that to the ZFS default of 128kb:

$> zfs get compressratio tank/db

NAME     PROPERTY       VALUE  SOURCE
tank/db  compressratio  7.59x  -

As it turns out, the 128kb blocks allow ZFS to better combine some of those 8kb Postgres pages to save space. That will allow our measly 2TB to go a lot further than is otherwise possible.

Please note that this is not de-duplication, but simple lz4 compression, which is nearly real-time in terms of CPU overhead. De-duplication on ZFS is currently an uncertain bizzaro universe populated with misshapen horrors crawling along a broken landscape. It’s a world of extreme memory overhead for de-duplication tables, and potential lost data due to inherent conflicts with the CoW underpinnings. Please don’t use it, let anyone else use it, or even think about using it, ever.

They made a record and it went in the chart

We’re still not done. One important aspect of ZFS as a CoW filesystem, is that it has integrated snapshots.

Consider the scenario where a dev is connected to the wrong system and drops what they think is a table in a QA environment. It turns out they were in the wrong terminal and just erased a critical production table, and now everyone is frantic.

We can simulate that by dropping the table ourselves. For the purposes of this demonstration, we’ll take a snapshot before doing so:

zfs snapshot tank/db@saveme
psql pgbench -c "drop table pgbench_accounts"

Agh! Luckily, ZFS has a clone ability that makes full use of its CoW nature. While ZFS snapshots are always available in a read-only manner, a clone is a fully writable fork of that snapshot. This means we can make a few slight changes and bring the clone online with the same data as when we took the snapshot.

Namely, we may need to copy and modify configuration files and remove any lingering postmaster.* files that may conflict with the running instance. Then we can just start the clone on a different port. At that point, there are two Postgres instances, and we can dump the missing pgbench_accounts table from the clone and import it back into the main database without skipping a beat.

Here’s how that would work:

zfs clone tank/db@saveme tank/recover

cd /tank/recover
rm postmaster.*

cp /usr/share/postgresql/10/postgresql.conf.sample postgresql.conf
cp /etc/postgresql/10/zfs/pg_hba.conf pg_hba.conf

pg_ctl -D /tank/recover -o "-p 5433" start

pg_dump -p 5433 -t pgbench_accounts pgbench | psql pgbench

With the table restored, everyone was happy, and the infrastructure team put in new firewall rules that prevented development systems from connecting to prod. All learned a valuable lesson and moved on with life.

Afterwards, we just need to clean up by stopping the clone Postgres instance and erasing the associated mount:

pg_ctl -D /tank/recover stop -m immediate
zfs destroy tank/recover

Many ZFS systems have automated snapshot creation and cleanup maintenance jobs running most of the time. This means we may already have an hourly, daily, or weekly snapshot available for recovery or investigation purposes.

While the Linux Volume Manager (LVM) system has similar functionality, LVM snapshots are limited to non-allocated space in the volume group, and must be mounted manually. ZFS snapshots come from the same storage pool, so we don’t need to guess at how large future snapshots may need to be. We don’t lose that potential allocation space from our volume. Clones are also mounted automatically in a way that shows their relation to the source.

ZFS makes it extremely easy to leverage snapshots. In fact, a development system could even designate a single Postgres instance as a snapshot source, and provide dozens of developers their own online copy of the database. It is after all, the same approach used for VMs running on ZFS.

ZFS also has a command that can send a snapshot to a remote ZFS filesystem. If a previous snapshot already exists there, it only sends the differences. This built-in differential is actually much faster than even rsync, as it already knows exactly which blocks are different. Consider what this means for VLDB Postgres systems that are several TB in size.

The future was wide open

It’s difficult to discount an immediately observable reduction in write overhead. Snapshots have a multitude of accepted and potential use cases, as well. In addition to online low-overhead compression, and the hybrid cache layer, ZFS boasts a plethora of features we didn’t explore.

Built-in checksums with integrated self-healing suggest it isn’t entirely necessary to re-initialize an existing Postgres instance to enable checksums. The filesystem itself ensures checksums are validated and correct, especially if we have more than one drive resource in our pool. It even goes the extra mile and actively corrects inconsistencies when encountered.

I immediately discounted ZFS back in 2012 because the company I worked for at the time was a pure Linux shop. ZFS was only available using the FUSE driver back then, meaning ZFS only worked through userspace with no real kernel integration. It was fun to tinker with, but nobody sane would use that on a production server of any description.

Things have changed quite drastically since then. I’ve stopped waiting for btrfs to become viable, and ZFS has probably taken the throne away from XFS as my filesystem of choice. Future editions of the Postgres High Availability Cookbook will reflect this as well.

Postgres MVCC and ZFS CoW seem made for each other. I’m curious to see what will transpire over the next few years now that ZFS has reached mainstream acceptance in at least one major Linux distribution.

Share this

Relevant Blogs

Random Data

This post continues from my report on Random Numbers. I have begun working on a random data generator so I want to run some tests to see whether different random...
December 03, 2020

More Blogs

Full-text search since PostgreSQL 8.3

Welcome to the third – and last – part of this blog series, exploring how the PostgreSQL performance evolved over the years. The first part looked at OLTP workloads, represented...
November 05, 2020

Números aleatorios

He estado trabajando gradualmente en el desarrollo desde cero de herramientas para probar el rendimiento de los sistemas de bases de datos de código abierto. Uno de los componentes de...
November 04, 2020