Detecting smelly backups with Barman

August 27, 2014

Ooooh that smell! Can’t you smell that smell?“. That’s a classic rock song by legends Lynyrd Skynyrd, I know. But also a warning that your new Barman 1.3.3 installation can now emit.

Ooooh that smell! Can't you smell that smell?Consider the following scenario:

  • You have scheduled a weekly full backup of your Postgres server with Barman – the usual (and boring) 4AM on a Saturday
  • You have even configured Nagios/Icinga to correctly monitor the state of the barman check command, as well as any other standard metrics of your Linux system
  • You have also setup pre and post backup hook scripts that notify you via email when the backup starts and finishes

However:

  • last Saturday your backup server had to go through a maintenance operation and had been switched off by IT operations for a few hours at 3AM
  • Activity was resumed at 6AM

It is Monday morning, you go back to work and start reading your email.

You do not (cannot?) realise that two emails are missing from your routine (the ones about start and stop of the backup operation) and you assume everything is fine. A warning email would just be better, wouldn’t it?

You are not aware that the weekly backup has not been taken and therefore, your recovery point objective (the time it takes for you to restore a database – in this case – after a disaster) could be compromised.

This is obviously a very simple (and mild) case.

Scenarios could be way more serious and painful, and generally they do fall under the “Inadequate Monitoring” category. A recurring case, unfortunately, is: the backup server is not properly monitored and barman backup repeatedly fails in the preceeding weeks due to insufficient disk space availability.

This is the reason behind the implementation of the last_backup_maximum_age option, a new feature of Barman 1.3.3 – going also under the name of smelly backups.

Let’s go back in time – we can, given that we work with Point In Time Recovery, can’t we? 😉

It is Friday morning, your read that Barman 1.3.3 has come out and you decide to:

  • update Barman to version 1.3.3
  • configure the last_backup_max_age parameter as follows:
    last_backup_maximum_age = 1 WEEK

It is Monday morning (again), you get back to work and you start noticing in your email box that your Icinga/Nagios server has been complaining for a day that barman check for that backup server did not return SUCCESS.

You can easily remedy the problem by connecting to the Barman server and manually issue a barman backup command. Once the backup ends, all your alerts vanish. Your last available backup is fresh again – and should smell really good now!

Technically, barman check now calculates the time interval between the end of the last available backup for a server and the current time (now). This interval is known as last backup maximum age. If it is greater than the given last_backup_maximum_age, barman check (including the '--nagios' variant) starts complaining.

If you have carefully looked into the output of barman check, you should have noticed that a new line has been added:

backup maximum age: OK (interval provided: 7 days, latest backup age: 4 days, 8 hours, 8 minutes)

The advice I give everyone, is to set this option in Barman as soon as possible, either at a global level or on a per server level. It won’t hurt and probably you will never notice it (or end up needing it). But it could really save you under some rare circumstances. Definitely, it will make your backup solution more robust in terms of monitoring.

A few more ideas in this field that will probably be put in the next releases of Barman are:

  • a check that calculates the space required for storing a new full backup and refuses to start it in case the available space is not enough
  • an option that allows to set what alerts to trigger for a specific server during barman check
Share this