Quick and Reliable Failure Detection with EDB Postgres Failover Manager

June 10, 2020

 

This blog is part of a series of blogs on best practices for high availability. In previous blogs, we discussed What Does High Availability Really Mean, Patching Minor Version in Postgres High Availability (HA) Database Cluster: Plans & Strategies for DBAs, and Key Parameters and Configuration for Streaming Replication in Postgres 12

In this blog, we will look into the failure detection mechanisms in a Highly Available cluster of EDB Postgres using EDB Postgres Failover Manager (EFM) and into the failover timelines.

EFM is a tool that can help users manage a highly available cluster of Postgres (master-standbys). It also helps in the detection of possible failures, recovering and sending notification of failures to users/DBA(s). For EFM, users/DBAs need to have at least three nodes to assure that EFM has a quorum. For three nodes, a user can have either one master, two standbys or one master, one standby, and one witness node.
 


EDB Postgres Failover Manager with Multi-node Cluster

Let’s have a look at the three-node cluster architecture of EDB Postgres Failover Manager with one master and two standbys.

In the above diagram, you can see EFM agents I, II, and III have a hard line for the local database and a dashed line for remote databases. The definition of two are given below:

  • Hard lines mean continuous monitoring of the local database.
  • Dashed lines mean on-demand remote database health check i.e. if an EFM agent fails to connect to the local database due to some reason it asks EFM agents on the other nodes to confirm if they can reach out to the database.

 

Automatic Failure Detection for Postgres 

EDB Postgres Failover Manager has the ability to detect the following types of failures in a Highly Available streaming replication cluster.

Database Failure

Database failure is a failure of the database, however, the EFM agents on the nodes/VM are available and are communicating with each other.

  • Master failure
    • In EDB Postgres Streaming Replication cluster managed by EFM,  if a primary/master database fails (for example due to a disk failure/RAM failure etc.) and EFM agent (I)  is not able to connect to the local database, then it asks its peersEFM (II), and EFM (III) agents to confirm if they can reach the master database or not. EFM (II) and EFM (III) agents try to use a remote database connection check to confirm the master’s availability. If the master is down, based on the responses from EFM (II) and EFM (III) agents, then EFM (I) notifies EFM (II) and (III) of the election of a new master and proceeds for failover of master to a standby which is close ( i.e. has received maximum transactions/WAL) to the master.
       
  • Standby failure
    • If EFM agent II running on standby 1 detects a failure of the standby database, then it follows the same procedure as mentioned for the master failure i.e. it requests its peer nodes EFM (I) and (III) to confirm if they can reach out to its local database. If the peer agents confirm the unreachability of the local database, the EFM sends a notification using either SMTP or it writes to a log file and stops monitoring the standby node.


Node Failure 

Node failure is a failure when the VM/node is not available due to a crash.

  • Master node/VM failure
    • If the Master node is down then EFM agents (II) and (III) don’t receive any communication from the EFM agent(I). Then EFM agent(II) and (III) confirm the unreachability of the master with each other. Based on the communication between EFM agent(II) and (III), they decide to proceed with the promotion of a standby which is most up to date ( i.e. has received the maximum of transactions/WAL from the master). When promoting the standby, EFM also sends a notification via SMTP of the log, informing the operator that the old master is no longer available.
       
  • Standby node/VM failure
    • If a standby node is down and EFM (I) and (III) don’t receive any communication from EFM agent (II), then EFM (I) sends the notification using SMTP.

 

Split Brain Situation

EDB Failover Manager manages to avoid a split-brain scenario after the failover. In the case of a master database failure, EFM promotes a standby to be the new master, and it also makes sure that the old master cannot restart by creating a recovery.conf file in the data directory. This avoids any kind of split-brain scenario, especially if the master failure was temporary or a DBA manually restarts the server.
In case of a master node failure, after failover, if the master node comes back, then the EFM agent checks the status of the cluster with other EFM agents in the cluster. In case EFM agents confirm the availability of another master in the same cluster, then an EFM agent on the old master node creates a recovery.conf to avoid the possible split-brain situation.

Please note that in case of a node failure, it is important to configure the system in such a way that the EFM service restarts before the EDB Postgres Advanced server service.
 


Failure Detection Parameters

EFM uses timeout parameters to monitor the local database and for confirming if a database has truly failed. These parameters also determine the total time taken by EFM to failover from one master to a standby  (which has now become the new master). Let’s have a look at these parameters:

The following are the parameters that determine the timeline of a failover.

 

EFM Database Failure Detection Parameters

  • local.period
    • This parameter defines how frequently the EFM agent checks the health of the local database server. The default is 10 seconds.
       
  • local.timeout
    • This parameter defines how long the EFM agent will wait for a response from the local database server. If this time passes without receiving a response, then the EFM agent proceeds to the final check. The default value is 60 seconds.
       
  • local.timeout.final
    • This parameter determines how long the EFM agent has to wait after the final attempt to contact the local database server. If it doesn't receive any response from the database within the number of seconds specified by this parameter, EFM declares the local database's failure. The default value is 10 seconds
       
  • remote.timeout
    • Based on the value of this parameter, the EFM agent knows how long it has to wait for a response from a remote database server to verify that the remote database is down. The default value 10 seconds

 

How Do These Failover Timeouts Work?

You might be thinking how does it all work together in case of a master database failure. Let's take an example from our diagram mentioned above. If for some reason the master database is down, and the EFM agent (I) does a periodic health check of the local database server every 10 seconds (local.period). If the health check failed and the failure response was received quickly, then the EFM agent (I) will perform the one final attempt for the local database health check and reach out to its peer agents for master/primary failover. In case the database is not responsive and has some network interface issue or delay for creating connection, EFM agent(I) will try to connect to the local database and will wait for 60 seconds (local.timeout). If EFM agent (I) doesn't get a response, then it will try to connect to the local database a final time and will wait for 10 seconds (local.final.timeout). In case EFM agent(I) gets failure response or doesn’t receive any response within the local.timeout.final, then EFM agen(I) declares the failure of the local database (master).

After declaring the failure of the master, EFM agent (I) reaches out to its peer agents (EFM (II) and EFM (III)) for confirmation. EFM agents (II) and (III) try to make a connection to the EFM agent (I)’s local database and send the information back to the EFM agent (I). If EFM agents (II) and (III) don't get a response in 10 seconds (remote.timeout) due to network issue/slowness or because the server is not responding, then they timeout or close the connection and inform EFM agent (I) that the master is unreachable. After receiving this confirmation, the EFM agent (I) sends a signal to EFM agents (II) and (III) asking them to promote one of the standbys to become the new master.

 

EFM Node Failure Detection Parameters

  • node.timeout
    • This parameter helps EFM agents to determine if a node in a HA cluster is down or not. If an agent doesn't communicate with other agents for the number of seconds specified by this parameter, then other agents assume that the node has failed and they should take action. 
    • In the case of the master node being unreachable, other agents plan for promoting a standby that is the most up-to-date with master based on the received transaction. The default value for this parameter is 50 seconds.


Now, you may be thinking why do we have such high value for these parameters by default? The answer lies in infrastructure reliability. We have these values high because by default EFM should work properly in an environment where the network/hardware/servers are not 100% reliable. However, if you use an environment or use a platform provider like AWS/Azure/GCloud that guarantees the reliability of the platform, then you may want to tune these parameters to reduce the failover time.

The following are the parameters’ values we used in the public cloud, based on the reliability of the network and the expected response from the database.

 

Parameters

EFM default value

Custom value on (AWS/Azure)

local.period

10

3

local.timeout

60

5

local.timeout.final

10

5

remote.timeout

10

5

node.timeout

50

10


With the above-mentioned custom settings, we saw EFM failover times in the range of 20 25 seconds in public clouds like AWS or Azure cloud within one region with different availability zones and across regions.

With EDB Postgres Failover Manager (EFM), a user can have a highly available Postgres cluster (master/standby/witness or master/standby/standby) with the following capabilities:

  1. Automatic failover in case of database/node failure. 
  2. Handling of the split-brain situation after failover. 
  3. Failover in a few seconds. 
  4. And, a user can tune the parameters for failover time, based on the reliability of their network/platform

 

Share this