One of the guiding Postgres design principles is heavy reliance on features provided by the environment (particularly operating system) and file systems are a prime example of this. Unlike other databases Postgres never supported raw devices, i.e. the ability to store data on block devices without creating a regular file system first—that would require implementing a “custom” file system, which might be tailored for the database needs (and thus faster) but it would require significant investment of development time (e.g. to support different platforms, adjust to ever-evolving hardware, etc).
As the development time is an extremely precious resource (doubly so in a small development team in early stages of an open-source project), the rational choice was to expect the operating system to provide a sufficiently good general-purpose file system, and focus on implementing database-specific features with high added value for the user.
So you can run Postgres on many file systems, which raises the question: Are there any significant differences between them? Does it even matter which filesystem you pick? Exploring these questions is the aim of this analysis.
I’ll look at a couple common file systems on Linux—both traditional (ext4/xfs) and modern (zfs/btrfs) ones, run an OLTP benchmark (pgbench) on SSD devices in different configurations, and present the results along with a basic analysis.
Note: For OLTP, it’s not really practical (or cost-effective) to use traditional disks. The results might be different and perhaps interesting, but ultimately useless.
Setup
I’ve used my two “usual” machines with different hardware configurations, a small one with SATA SSD and a bigger one with NVMe SSD.
i5
- Intel i5-2500K (4 cores)
- 8GB RAM
- 6x Intel DC S3700 100GB (SATA SSD)
- PG: shared_buffers=1GB, checkpoint_timeout = 15m, max_wal_size = 64GB
- ZFS tuning: full_page_writes=off, wal_init_zero = off, wal_recycle = off
xeon
- 2 x Intel e5-2620v3 (16/32 cores)
- 64GB RAM
- 1x WD Gold SSD 960GB (NVMe)
- PG shared_buffers=8GB, checkpoint_timeout = 15m, max_wal_size = 128GB
- ZFS tuning: full_page_writes=off, wal_init_zero = off, wal_recycle = off
Both machines are running kernel 5.17.11 and zfs 2.1.4 or 2.1.5.
On the i5 machine we can test different RAID configurations too. Both zfs and btrs allow using multiple devices directly, for ext4/xfs we can use mdraid. The names (and implementations) of the RAID levels differ a bit, so here’s a rough mapping:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note: The number in the first column means number of “data bearing” disks. For example with striping all N disks are used to store data. With raid5, one of the drives is used to store parity information, so we only use (N-1) data disks. With raid10, we keep a copy for each piece of data, so we have N/2 capacity.
Note: btrfs defines raid1 a bit differently from mdraid, as it means “2 copies” and not “N copies”, which makes it more like raid10 than mirroring. Keep this in mind when interpreting the results/charts.
Note: btrfs implements raid5/6, but this support is considered experimental. It’s been like this for years, and considering the change is more to discourage people from using btrfs with RAID5/6 than to fix the issues, I’m not holding my breath. It’s available, so let’s test it but keep this in mind.
For more detailed description of the various RAID levels, see here.
Some basic file system tuning was performed to ensure proper alignment, stripe/stride and mount options. For ZFS, the tuning is a bit more extensive and sets this for the whole pool:
- recordsize=8K
- compression=lz4
- atime=off
- relatime=on
- logbias=latency
- redundant_metadata=most
All the scripts (and results) from this benchmark are available on my github. This includes setup of the RAID devices/pools, mounting etc.
Note: If you have a suggestion what other mount options to use (or any other optimization idea), let me know. I picked the optimizations that I think matter the most, but maybe I was wrong and some of those options would make a big difference.
i5 (6 x 100GB SATA SSD)
Let’s look at the smaller machine first, comparing the file systems on multiple devices in various RAID configurations. These are simple “total throughput” numbers, from sufficiently long pgbench runs (read-only: 15 minutes read-write: 60 minutes) for different scales.
read-only
For read-only runs, the results are pretty even, both when the data fits into RAM (scale 250 is ~4GB) and when it gets much larger (scale 5000 is ~75GB).
The main difference seems to be that zfs is consistently slower—I don’t know why exactly, but considering this affects even the case when everything fits into RAM (or ARC), my guess is this is due to using more CPU than the “native” Linux file systems. This machine only has 4 cores, so this may matter.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
Table: Results for i5/scale 250/read-only
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
Table: Results for i5/scale 5000/read-only
read-write
For read-write tests, the results are much less even, both for all dataset sizes.
There are a couple interesting observations. Firstly, ext4/xfs clearly win, at least when it comes to total throughput (we’ll look at other stuff in a minute). Secondly, for the small data set ZFS is almost as fast as ext4/xfs, which is great—this covers a lot of practical use cases, because even if you have a lot of data, chances are you only access a tiny subset.
The unfortunate observation is that btrfs clearly underperforms, for some reason. This is particularly visible for the smaller data set, where btrfs tops at ~5000 tps, while the other filesystems achieve much higher (2-3x) throughputs.
Note: This is the place to remember btrfs defines raid1 differently, which means the “good” result for the larger data set is not that great because instead of 6 copies btrfs only keeps 2 (thus having 3x more data disks than the other file systems).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
Table: Results for i5/scale 250/read-write
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
Table: Results for i5/scale 5000/read-write
xeon (1x NVMe SSD)
Now, let’s look at a “bigger” machine, which however only has a single SSD (NVMe) device. So we can’t test various RAID levels, which means we can present the results in two simple charts.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table: read-only results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table: read-write results
For the read-only tests, the results are (again) pretty even - ZFS is a bit slower than the other file systems, particularly for the larger scales, similarly to the smaller machine.
For read-write tests, the differences are much more dramatic. EXT4/XFS are the clear winner. ZFS keeps pace on the two smaller data sets (that fit into shared buffers / RAM), but on the scale 10000 it drops to only ~6k tps (compared to ~30k tps for ext4/xfs). Btrfs performs quite poorly—on the small/medium data sets it’s somewhat competitive with ZFS, but other than that it’s clearly underperforming.
Throughput over time
There’s one important thing the throughput results presented so far ignore—stability of the throughput, i.e. how it evolves over time. Imagine two systems running the same benchmark:
- system A does 15k tps consistently, with minimal variation during the benchmark run
- system B does 30k tps in the first half, and then gets stuck and does 0 tps
Both systems do 15k tps on average, and so would have the same chart in the charts presented so far. But I assume we’d agree system A has much more consistent and predictable performance.
So let’s look not just at the total throughput, but also at how throughput develops over time. The charts in this section show throughput for each second during a 60-minute read-write benchmark (blue), with a running average over 15-second intervals (red).
This should give you some idea how much the throughput fluctuates, and also trends / patterns (degradation over time, impact of checkpoints…).
i5 / scale 1000
All results in this section are from RAID0/striping setup. For scale 1000 (exceeds RAM, but small enough compared to wal_max_size), we get this:
--------
It’s immediately obvious that ZFS has exceptionally clean and consistent behavior - with the default config there’s some checkpoint impact, but disabling FPW makes that go away. As a DBA, this is what you want to see on your systems—minimum differences (jitter) during the whole benchmark run.
EXT4/XFS achieve higher throughput (~7.5k tps vs. 6.5k tps, so ~20% increase), but the jitter is clearly much higher. The per-second throughput varies roughly between 5k and 9k tps—not great, not terrible. There’s also clear checkpoint impact, but we can’t disable FPW to eliminate this—we could only make checkpoints less frequent by increasing the timeout / WAL size.
BTRFS has more issues, though—the throughput is lower than with ZFS while the jitter is much worse than with EXT4/XFS. On average (the red line) it’s not that bad, but it regularly drops close to 0, which implies high latency for some transactions.
i5 / scale 5000
Now let’s look at the “large” data set (much larger than RAM).
--------
The first observation is that the impact of checkpoints is much less visible—both for ZFS with the default config and EXT4/XFS. This is due to the dataset size and random access—this means the number of FPW produced can’t drop before hitting the WAL limit (and starting the next checkpoint).
For BTRFS, the situation is much worse than on the medium (scale 1000) data set—not only the jitter is even worse, but the throughput is gradually dropping over time—it starts close to 4000 tps, but at the end it drops to only about 2000 tps.
xeon / scale 1000
Now let’s look at throughput on the larger machine, with a single NVMe device. On the medium scale (larger than shared buffers, fits into RAM) it looks like this:
--------
On ZFS with default configuration, the checkpoint pattern is clearly visible. With the tuned configuration, the pattern disappears and the behavior gets much more consistent. Not as smooth as on the smaller machine with multiple devices, but better than the other file systems.
There’s very little difference between EXT4 and XFS, both in total throughput and behavior over time. The total throughput is better than with ZFS (40k vs 60k), but the jitter is more severe.
For BTRFS, the overall throughput is fairly low (~30k tps), while the jitter is somewhat better and worse than for EXT4/XFS at the same time. It seems mostly closer to the average throughput, but at the same time it often drops to 0.
xeon / scale 10000
On the large (~150GB) data set, the results look like this:
--------
This is similar to what we saw on the smaller machine—checkpoint pattern disappears, due to the amount of FPW we have to write to WAL.
On ZFS with the tuned configuration, the jitter increases significantly. I’m not sure why, but I’d say it’s still fairly consistent (compared to the other file systems).
For EXT4/XFS, the throughput drops to ~50% (compared to the medium scale) while the jitter remains about the same—I’d argue this is a pretty good result.
With BTRFS we see the same unfortunate behavior as on the smaller machine - with the large data set the throughput gradually decreases, with significant jitter.
Conclusions
So, what have we learned? The first observation is that there are significant differences between the evaluated file systems—it’s hard to pick a clear winner, though.
The traditional file systems (EXT4/XFS) perform very well in OLTP workloads, at least in total throughput. ZFS performs pretty well too—it’s a bit slower in terms of throughput, but the behavior is very consistent with minimal jitter (particularly with disabled full-page writes). It’s also quite flexible—for example it allows moving ZIL to a separate device, which should improve the throughput.
As for BTRFS, the results are not great—I did a similar OLTP benchmark a couple years ago, and this time BTRFS performed a bit better, in fact. However, the overall consensus seems to be that BTRFS is not particularly well suited for databases, and others observed this too. Which is a bit unfortunate, as some of the features (higher resilience, easy snapshotting) are very useful for databases.
The last thing to keep in mind when reading the results is to realize these are stress tests, fully saturating the I/O subsystem. That’s not what would be happening on most production systems—once the I/O gets saturated for extended periods of time, you’d be already thinking about upgrading the system to reduce the storage load. Which typically means the latencies should improve, etc.