Can PgBouncer handle failover to a new machine?

December 17, 2019

Part Three: PgBouncer and persistent connections during downtime

(This is part three of my technical response to a series of questions about the use of PgBouncer and what you need to look out for. Part two can be found here)

So, now we have successfully renegotiated a one-minute network timeout, and a server restart, all within a single PgBouncer session, and have done so silently. To a user in an application, there may at worst be an error message; however, the transaction can be completed at the next try in the same session, and, under most circumstances, there will only be a delay in the transaction reporting back as complete to the application. How the application deals with that delay is up to the application, not PgBouncer.

But what about failover? What if an entire machine goes down and another is brought up to replace it? Surely PgBouncer can’t cope with that. It is, after all, a completely different machine.

Well, actually, it can. If you have two or more databases in streaming replication, you can use an EDB tool called EDB Failover Manager™ (EFM) with Virtual IP (VIP) capability. This tool allows for transparent and automatic failover, or manual switching - commonly used during maintenance of databases in a cluster - between the databases in replication. Connections to the VIP are later re-established on the new master.

You can connect PgBouncer to the VIP, rather than the IP address of the box itself, and following a change of master; PgBouncer will reconnect to the new master and what looks like the same database as if there had simply been a network outage.

To prove it, we’ll first need to set up a more complicated architecture.

We need to set up (at least) two more servers. One server will be a standby to our current database, and the database we have been using will be our initial master database. We will set up streaming replication between these databases and then set up EFM nodes to control them. Then we will use EFM and its VIP capability to fail over between them, and, by connecting pgbouncer to the VIP rather than the box itself, we will mimic a network outage rather than a complete change of machine.

The other server will be the EFM witness node. The witness node’s role is to sit in an EFM cluster and maintain quorum between the other node, or nodes, in the cluster. In this way EFM avoids what is called “split brain,” where two or more nodes in the cluster think they are the master database at the same time. It is worth noting that no master nodes in a cluster would be considered preferable to two - or more - the point being that with each extra master comes an extra end point for applications to connect and write to, which is going to make consolidation of data, when rebuilding the new master after failure, orders of magnitude more difficult.

Think of the EFM witness node as a watchdog that makes sure everything goes smoothly, and make sure that at least two EFM nodes agree on what action to take between them. The EFM witness node itself isn't associated with a database. Also, because it's relatively lightweight, it's often bundled with other EDB products outside of the main databases and their own EFM nodes. 

To set up two EPAS databases in streaming replication, follow the instructions here:

https://www.enterprisedb.com/edb-docs/d/edb-postgres-failover-manager/user-guides/user-guide/3.6/configuring_streaming_replication.html

To set up a master, standby, and witness EFM cluster, follow the instructions here:

https://www.enterprisedb.com/edb-docs/d/edb-postgres-failover-manager/user-guides/user-guide/3.6/tutorial.html

Take note to open ports 7800:7810, and set up your VIP as shown here:

https://www.enterprisedb.com/edb-docs/d/edb-postgres-failover-manager/user-guides/user-guide/3.6/using_vip_addresses.html

So, having set up our EFM cluster with VIP support, let’s check that everything is ready to go:

# ./efm cluster-status efm

Cluster Status: efm



Agent Type  Address              Agent  DB       VIP

-----------------------------------------------------------------------

Master      192.168.15.6         UP     UP       192.168.15.200*

Standby     192.168.15.8         UP     UP       192.168.15.200

Witness     192.168.15.9         UP     N/A      192.168.15.200



Allowed node host list:

192.168.15.9 192.168.15.6 192.168.15.8



Membership coordinator: 192.168.15.9



Standby priority host list:

192.168.15.8



Promote Status:



DB Type     Address              WAL Received LSN   WAL Replayed LSN   Info

---------------------------------------------------------------------------

Master      192.168.15.6                            0/50022F0          

Standby     192.168.15.8         0/50022F0          0/50022F0          



Standby database(s) in sync with master. It is safe to promote.

 

The databases are in sync and ready to fail over, and the VIP is visible across all three nodes in the EFM cluster. Although, as we can see from the asterisk (*) on the Master row, only the master database is using the VIP right now.

So, connecting PgBouncer to the VIP:

$ grep pgb /etc/pgbouncer/pgbouncer.ini

pgb = dbname=pgb host=192.168.15.200 port=5444

 

Restart PgBouncer, and attempt to connect to the master database (currently using the 192.168.15.200 VIP address in my example).

$ psql -U pgb -h 127.0.0.1 -p 6543 pgb

pgb=# select * from test;

 x 

---

 1

 2

 3

 4

(4 rows)

 

Connecting to the VIP works!

Now we will switch over to the standby (which will become the new master) using EFM, and the master will become a standby. We do this by using the EFM promote command, which is useful during maintenance, for example.

# ./efm promote efm -switchover

Promote/switchover command accepted by local agent. Proceeding with promotion and will reconfigure original master. Run the 'cluster-status' command for information about the new cluster state.

 

Great. This was accepted by EFM, and the roles of the databases have swapped, We can now check that the databases are still in sync.

# ./efm cluster-status efm

Cluster Status: efm



Agent Type  Address              Agent  DB       VIP

-----------------------------------------------------------------------

Standby     192.168.15.6         UP     UP       192.168.15.200

Master      192.168.15.8         UP     UP       192.168.15.200*

Witness     192.168.15.9         UP     N/A      192.168.15.200



Allowed node host list:

192.168.15.9 192.168.15.6 192.168.15.8



Membership coordinator: 192.168.15.6



Standby priority host list:

192.168.15.6



Promote Status:



DB Type     Address              WAL Received LSN   WAL Replayed LSN   Info

---------------------------------------------------------------------------

Master      192.168.15.8                            0/60001A8          

Standby     192.168.15.6         0/60001A8          0/60001A8          



Standby database(s) in sync with master. It is safe to promote.

 

And they are. The VIP is now held by the new master database, and the write ahead logs are in sync. But did PgBouncer survive?

From the same PgBouncer session as before:

pgb=# select * from test;

 x 

---

 1

 2

 3

 4

(4 rows)

 

So, we have negotiated a controlled switchover using EFM and the VIP capability between two completely different machines, and as far as PgBouncer is concerned, the session is still active as if nothing had happened.

In the next blog, I will highlight how to simulate a complete database outage and what this means for a PgBouncer session.

 

Share this

Relevant Blogs

Finding memory leaks in Postgres C code

I spent the last week looking for a memory leak in Postgres’s WAL Sender process. I spent a few days getting more acquainted with Valgrind and gcc/clang sanitizers, but ultimately...
March 27, 2024

More Blogs

Let's Workshop an Unplanned Postgres Outage

It’s not a controversial statement to say that no database maintenance is without risk. Postgres and its community provides several useful tools to minimize impact of even major overhauls such...
July 07, 2023