A Complete Guide to PostgreSQL Backup & Recovery

October 06, 2021

Backups are often taken based on the suggestions and recommendations of the person we believe knows best. We don’t try to understand what situations led to those recommendations and how they might be different from what we’re facing. We need to list the situations we might need to recover from and then discuss the available solutions. Every environment has its own challenges; even if we have multiple database clusters in the same organizations, the backup requirements for may vary. In short, planning is key to backup and recovery.

 

PostgreSQL backup “WHY”

The reason why we have backups or plan for backups is to prevent data loss and ensure continuity. However, we never study our own environment, which is unique for everyone, for how a data loss can happen. There can be many reasons for data loss, but here are a few that we all may have encountered in our careers:

  • Data Corruption: The most important and critical one. Even if you have multiple replicas available it is very likely that the corruption can spread (much quicker than you think) across the replicas making them redundant for running any kind of production. 
  • Device Failure: This is a very common situation. This can lead to unavailability of the whole datacentre which leads into all running instances being unavailable, and would require the cluster to be rebuilt from a backup. 
  • Human Error: This one spreads quickly across replicas. Try running a drop table command and then stop it from replicating. Despite numerous precautions, there can always be a slip that can put you in a tight spot. 
  • Software Errors: There will always be a bug here or there at any level of the software being engaged. It could be an operating system issue or even a network transfer issue that can lead to missing bits or bytes, then unexpected results. 
  • Compliance requirements and Retention Policies: Depending on your industry type, you may need to retain or store data for a certain period of time in order to stay compliant. You might be required to keep backup copies of your data in order to ensure that any changes can be tracked/ reverted over that period.

 

PostgreSQL backup options

PostgreSQL provides several options for backups in necessary formats and automation. It is the most powerful and trusted open source database and in terms of backup options, we can use the built-in tool or operating system level options. There are also block level backups we can take advantage of using some very reliable tools on the market. 

Built-in Options:

The built-in options are primarily based on the two applications bundled with PostgreSQL

  1. pgdump / pgdumpall
  2. pg_basebackup

External Options:

Due to the simplicity of the way PostgreSQL software is integrated with the use of operating system dependencies, it becomes easy to use PostgreSQL with any of your organization’s tools at a file system level. 

  1. Physical backup of the data directory
  2. Block level backup of the volume on which data directory resides
  3. Barman and pgBackrest

The only challenge with the file system and block level backups is that the database either needs to be taken offline or needs to be put into snapshot mode using the pg_start_backup() function to ensure that the backup is consistent and recoverable to the point when the backup starts. 

 

How do I back up and restore a PostgreSQL database?

The backup and restore operations using the PostgreSQL are well documented for the in-built options of PostgreSQL. Follow these steps to make a base backup from PostgreSQL options:

pg_dump

pg_dump can generate an sql file that can be fed back to the server using any SQL client that can connect to the PostgreSQL database. This gives you the option to load it in a destination which can be on another server and a different version—this would still require compatible data types and a schema. This is helpful in manipulating the content as the file is human readable for any test environments you might have. It doesn’t require any special permissions as this runs as an SQL application and needs read-only access to the database that you need to backup. What also helps is the fact that you can make backups of portions of the database that you can access. In order to backup the entire database, you will still need superuser access. 
Here is a sample of how you can execute pgdump on the same host as the database itself:

pg_dump dbname > dumpfilename

Since this is an sql client, it can also use the sql strings to connect to a remote database in order to take the backups of parts of, or the entire  database to which you have access. Here is an example of the remote command:
 

For Schema

pg_dump -u username -h hostname -p port -n schemaname

 

For Database (backup one database)

pg_dump -u username -h hostname -p port -d databasename

 

For Table

pg_dump -u username -h hostname -p port -t tablename

Since pg_dump is using a COPY command at the backend, it results in additional I/O which can cause performance issues. Not using the PostgreSQL cache or the file system cache also results in a performance hit. The dump is affected by the options we use as well. We do have compression options available but it's both CPU and I/O intensive both while making a dump and while restoring the dump. When compression is used, it’s a binary dump instead of SQL dump.
 

pg_dumpall

Using pg_dumpall you can dump an entire PostgreSQL cluster in a single command. You can dump it in a file which is human readable.

$ pg_dumpall -U postgres -h localhost -p 5432 --clean --file=cluster.sql

For more information about compression and custom format please click here.

Restoring the backup from pg_dump backup (pg_restore)

pg_restore is the tool we use to restore if we have compressed or saved the output in a custom format. Like the behaviour with pg_dump, pg_restore creates a connection with the database and then restores the full backup (depending on the backup itself). It does allow some flexibility in restoring tables and schemas in order to extract certain data when required to recover from an unwanted situation. 

Here is a sample command for restoring a custom format dump using pg_restore

pg_restore -C -d databasename db.dump

Here are some of the important options when using a pg_dump which can be used to dump or export a PostgreSQL database

  • a it takes a data only dump 
  • s it takes schema only dump
  • t this takes the table level dump. The usage would be 
pg_dump -d <database_name> -t <table_name> > tabledump.sql

This is helpful in recovering tables.

pg_basebackup

This is the tool which most of the PostgreSQL DBA’s would end up using on a daily basis be it via scripts or manually. This creates the base backup of the database which can help in recovering in multiple situations. This takes an online backup of the database and hence is very useful when being used in production. It also makes a backup of the entire cluster so you do not have to worry about missing anything. The basic syntax used is as follows: 

pg_basebackup -h serverip -D targetdirectory

There are several options available with pg_basebackup as well and can be checked by clicking this link.

The most useful options are:

  • F this gives you an option to save it as is or to tar in order to compress the file size
  • R if you are using this to create a standby, this will automatically add a recovery.conf to the backup folder
  • X will include the wall files generated while the backup was being taken, this will ensure that your backup is up-to-date

Additionally, pg_basebackup is backwards compatible, which translates to the fact that you can use a newer version of the utility to take the backups of all the servers in your environment.

 

Point in time recovery (PITR) for PostgreSQL

Another iteration of the above which can be used for point in time recovery (PITR) is setting the parameter recovery_target_time to a time before the point at which we think the issue occurred. Taking the example at the beginning of this article, if we had all the WAL archives since the base backup, we could have gone ahead and set the time to a known time before the table was dropped and we could have recovered most of the data in the table.
 
It’s worth mentioning here that now that we understand that WAL files are the brick and mortar file recovery is built upon, it always helps to keep the pg_xlog/pg_wal folder with all the WAL files unremoved to recover disk space. The server ensures that any WAL files that are not required are cleaned up at routine intervals.
 
Here are the steps needed to do a PITR testing in a test environment. To get a better understanding of this process you can try the following steps:

1. Take a base backup of the cluster

-bash-4.2$ pg_basebackup -D /tmp/backup/ -Ft -p 5444 -P -U enterprisedb
190646/190646 kB (100%), 1/1 tablespace

2. Create a new data directory and copy a backup from  /tmp/backup/ to /var/lib/edb/as11/data/

3. Remove any wal files from wal directory 

4. Create a recovery.conf

-bash-4.2$ pg_basebackup -D /tmp/backup/ -Ft -p 5444 -P -U enterprisedb
190646/190646 kB (100%), 1/1 tablespace

5. Start the database

-bash-4.2$ ./pg_ctl -D /var/lib/edb/as11/data/ start

This will recover the database till the target time stamp

For example:

If I want to recover until the second insert then I will check the time after the second insert by using select now(); and then I will set that time in the recovery.conf file for the recovery_target_time.

edb=# insert into emp values(1,'a');
INSERT 0 1
 
edb=# insert into emp values(2,'b');
INSERT 0 1
 
edb=# select now();
 
               now               
---------------------------------
 26-NOV-19 22:45:55.36366 +05:30
(1 row)
 
recovery_target_time = '26-NOV-19 22:45:55'

 

Backups and the buts that can help plan

Now it comes down to how and what kind of backups are available and where they can help us. Our options with PostgreSQL are below:

Logical Backups

  • pg_dump
  • pg_dumpall 

A Logical dump is nothing more than a SQL dump of the database and its objects.

Physical Backups

  • File system snapshots
  • pg_basebackup
  • Manual base backup

Backup time will depend on the method used to make backups and the database size. You should also take time to give some consideration to your space requirements for your backups. If you are making physical backups, the space requirement may be much higher compared to a logical backup. However, you should plan your backup strategy based on the RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Let's explore some recovery scenarios (Recovering from a table to a complete cluster) and recovery options.

Let's look at individual recovery objects and choose the fastest backup method:

  1. Recovering a single table: The fastest way to recover a table is from the SQL dump created using pg_dump or pg_dumpall. Since it’s a plain-text file (can also be extracted from a custom format created by pg_dump -Fc) you can search and restore the table definition and its data in just two steps. This gives you flexibility to go ahead and search for certain records and restore them.
  2. Recovering a single database (or Importing a PostgreSQL Database dump): Again, ignoring the size factor for the moment, the fastest way would still be pg_dump and pg_dumpall. Here we might need to take into consideration that the sql dump does not contain the index data (obviously), and rebuilding all the indexes could be very time consuming. 

Since the SQL files run a COPY command at the backend, query tuning techniques can be applied to maximize the performance. The same can be tested and documented in a test environment to get to an optimal configuration in case a restore is required in future. 
Recovering an entire cluster: A cluster might consist of multiple databases as well as multiple extensions, so it becomes very important that everything be restored to the closest state possible. Here the best way is to use the snapshots using either File System Level backups or pg_basebackup. This ensures that you do not have to do any manual work and you just need to point to the data directory (after correcting permissions) and you can start your cluster.

 

Barman (Backup and recovery manager)

Barman is an open source backup management tool. It relies on the internal capabilities of PostgreSQL and allows DBAs to remotely manage and recover multiple servers. It’s distributed under a GNU GPL 3 and is maintained by EDB.
 
Key capabilities of Barman include:

  • Remote backup and restore with rsync and the PostgreSQL protocol
  • Support for file level incremental backups with rsync
  • Retention policy support
  • Archive compression with gzip, bzip2, or pigz
  • Backup data verification
  • Backup with RPO=0 using a synchronous physical streaming replication connection
  • Rate limiting
  • Tested with custom WAL sizes
     

 

Write ahead logs (WAL) & delayed replica of standby databases

Here’s a typical situation: my coworker accidentally dropped a table. We have a monthly backup policy, and I have the backup that was taken 29 days ago. However, I took the backup of the data folder immediately after my coworker’s mistake, and I also have all the WAL files since the time he dropped it—can you help recover the table?

Let's pause here to take stock and understand what we have at hand to recover. We have two backups (the latest backup and the 29 day-old previous backup), as well all the WAL files since the table was dropped. WAL files are Write Ahead Logs, which means we can move forward from the point when a backup was done. In the above scenario, the latest backup was taken after the occurrence of the issue—so while it’s current, we have to accept that the table is gone, as it’s a backup for that point in time only. The month-old backup might be the best way to recover, albeit at the cost of a month’s worth of data loss.

Another question that is often raised in these conversations: “We have multiple standbys, so do I still need to plan and take backups?”  While it’s good that you can recover from a hardware failure, recovery from human error can be much more expensive, as detailed in the example above.

Circling back to the beginning of the example we started with, my coworker dropped the table, and it replicated to all my standbys instantaneously. Trust me, you will find out that it’s the fastest transaction that you will ever see, no matter whether you are on cloud or on-premises, and any number of standbys that you have are not enough to save you from this disaster (or any other human error) unless you have a delayed standby in your cluster.

Making use of a delayed replica allows you to have an insurance plan of sorts for your quorum by using recovery_min_apply_delay parameter in the postgresql.conf file of the “break glass in case of emergency” standby.  You can set this rather easily:

recovery_min_apply_delay = '1h' 

The value can be configured as hours, minutes or days as it shows in the documentation here.

 

PostgreSQL configuration & WAL 

Understanding what happens behind the scenes

PostgreSQL is a very unique database in its implementation of transaction logs and process architecture. WAL is one of the most important pieces of the PostgreSQL ecosystem. The files are written to pg_xlog / pg_wal directory. Every change made is first written to WAL, hence its name, and the WAL is considered a synonym for transaction logs. There are two minimum settings required: archive_mode & archive_command. In order to ensure consistency, each WAL is written and checked for completion. This makes the base of the recovery process.  Both these parameters can be set in the postgresql.conf. archive_mode which only has three values, off, on or always. There isn’t much difference between “on” and “always.” While it’s on the “on'' value, it only creates WAL archives when the server is up and running. While it’s on “always” value, it continues to generate the archives even when they are being restored. archive_command is the shell command which is used for setting up WAL archiving. An example below from the PostgreSQL documentation:

archive_command = 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

Here %p in the string is automatically replaced by the path name of the file to archive, and any %f is replaced by only the file name.

Now, let’s understand what goes behind the scenes while we are recovering from a backup. When the PostgreSQL database starts, it starts from a consistent state—the state where the checkpoint has been completed at the time of stopping or taking the backup of the database. (In case you are using a backup, you first need to create a base backup.) Then, it checks for the WAL files, and begins reading them until it finishes all of them. Here is just an idea about how it appears in the log files:

2019-10-26 14:06:16 UTC [16-3] LOG:  starting archive recovery
2019-10-26 14:06:16 UTC [16-4] LOG:  restored log file "000000010000000000000017" from archive
2019-10-26 14:06:16 UTC [16-5] LOG:  redo starts at 0/17000060
2019-10-26 14:06:16 UTC [16-6] LOG:  consistent recovery state reached at 0/17000160
cp: cannot stat '/var/lib/postgresql/backup/000000010000000000000018': No such file or directory
2019-10-26 14:06:16 UTC [16-7] LOG:  redo done at 0/17000160
2019-10-26 14:06:16 UTC [16-8] LOG:  restored log file "000000010000000000000017" from archive
2019-10-26 14:06:16 UTC [16-10] LOG:  archive recovery complete

While trying to recover from a backup, we use the recovery.conf to guide the leader to a consistent state. In versions below 12, the parameters are set in the recovery.conf file however moving forward, the parameters would be available in PostgreSQL.conf. The server uses recovery_target_time to understand what point it needs to recover and therefore, when it is set to latest as explained above, it reads all the WAL files.

 

Managing PostgreSQL backup & recovery

It is very easy to say that one should keep all types of backups to be able to recover from all the scenarios above, but we do understand that managing backups and keeping them in all formats is not only time consuming but also requires hardware. Imagine me asking you to take a pg_dump of a 1 TB database. I am sure the reaction would be “Yeah, right!” Here are a few recommendations based on the size of the database (or a cluster) so as to plan the best possible recovery for most scenarios:

  1. DB Size less than 100 Gb: Since this would require much less space and time, the recommendation here is to make both kinds of backup (physical as well as logical). Doing so will give you some independence in order to recover quickly no matter what the situation is. 
  2. DB Size between 100 Gb and 500 Gb: It would be best to make both kinds of backups, but if you do have to choose between the two, the recommendation is to use physical backups. It gives you some flexibility (with some time and effort) to restore the backup and extract any tables or database to recover from the failures at that level.
  3. DB size greater than 500 Gb: This is where using pg_dump becomes very challenging, and a physical backup is most likely the best path to go on. As suggested above, it can help recover individual objects and databases as well by doing some due diligence. Proper planning and testing is required in handling large databases. The restoration is time consuming so having a fair idea about how much time it is going to take in advance helps in planning contingency measures.

 

Frequency of PostgreSQL database backups

Time to list down your ifs and buts to build the best strategy

The last, but not the least, consideration in this whole picture is backup frequency. This is where generalized suggestions reach their end. In order to even comment on this, we need to understand the mechanics of the environment. That means we need to look at 

  • The frequency of the transactions, 
  • The rate of change in the database size 
  • The criticality of the application its risk tolerance when it comes to availability versus data loss 

You need to carefully study these factors, and then do some test runs with the backups; this helps you verify the recoverability and understand the time taken to recover. Remember, even if you have a monthly backup frequency and all the WAL files in between, the time it takes to read all of these WAL files and replay the transactions may take longer than you can tolerate to get back up and running.

Circling back to where we started—take suggestions from the experts, but make sure you test those suggestions in your environment and tailor the plan to fit the strategy and outcomes you need, in order to be ready with the right backup and methods when the need arises.
 

Managing backups with barman

It’s important to be cautious when managing backups, so you may not want to depend on cron jobs; Barman is a great tool to use. It’s a simple tool with just one command which can take care of backups with multiple switches. The key focus is to ensure that your data is safe, so the recommended architecture itself states that you should have a dedicated Barman server which doesn’t share the storage with the PostgreSQL server. Another additional advantage is that you can manage multiple database servers and multiple database versions with a single instance of Barman. It’s an essential tool for your environment as it provides geo-redundancy by geographically distributing backups and cloud options.

Installation and configuration of BARMAN is pretty straightforward. One can run the following command to get the latest version of BARMAN:

yum instal barman

After inputting this command, you will need to set up the BARMAN configuration.

The location and a sample of the same is:

Location: /etc/barman.conf

Sample BARMAN server configuration:

[barman]
barman_user = barman
configuration_files_directory = /etc/barman.d
barman_home = /var/lib/barman
log_file = /var/log/barman/barman.log
log_level = INFO
compression = gzip

After the server configuration, you will  need to add the servers which would be backed up. One of the prerequisites is to ensure that you have a super user configured at the database end or have a user with replication privileges. 

A sample configuration for rsync based backup would be:

ssh_command = ssh root@pg
conninfo = host=pg user=barman dbname=postgres
backup_method = rsync
parallel_jobs = 1
reuse_backup = link
archiver = on

It also supports streaming replication which can be configured by referring to the documentation. With BARMAN 2.0 onwards it offers the advantage of the streaming replication capabilities of pg_bseback itself. 
Once the configuration and access are done, one can execute the following commands:

To check the configuration:

$barman check pg

And in order to take a backup:

$barman backup <servername>

In order to automate the whole procedure, one can use cron jobs to automate the backups with BARMAN as well. 

Note to everyone: There is no tool to verify that the backup of PostgreSQL which you have taken is valid until you are able to restore it and read the data. Be sure that your backup strategy includes a schedule for making backups as well as a window to restore and test the backups. As a side note, there is a tool called pg_verify_backup that is helpful for detecting certain issues, but it cannot ensure that the backup is fully compliant from an operational perspective.

Read morePostgreSQL Backup & Recovery: The Ifs, Ands & Buts

Share this