Creating High Availability PostgreSQL Clusters to Boost Resiliency

How to build robust PostgreSQL clusters to optimize replication and ensure high availability with Raft-based client connection routing

The February 2023 release of EDB Postgres Distributed 5.0 ushered in multiple enhancements to EDB’s premier write-anywhere Postgres platform. One included an overhaul of how PGD handles client connection routing. It’s a significant departure from the past design, so we wanted to discuss why we made these changes and the improvements they bring.

When we first introduced EDB Postgres Distributed (PGD), it went by BDR, short for Bi-Directional Replication. Since writes can occur on any node, there’s an inherent risk of write conflicts if two nodes simultaneously modify the same row. PGD provides several mechanisms to address this, such as custom conflict handling, column-level conflict resolution, Conflict-free Replicated Data Types (CRDT), and eager resolution, among others. Yet, the best way to prevent conflicts is to avoid them.

The easiest way to do this is to prevent writes on multiple PGD nodes. We introduced High Availability Routing for Postgres or HARP, a Patroni-inspired design based on a Distributed Configuration Store (DCS). The idea was simple: each PGD node has a HARP manager service that conveys node status to the DCS, and any HARP router contacts the DCS to obtain the canonical lead master of the database cluster. Clients then connect to the HARP router.

The architecture of this was a bit more complicated:

The illustration above is a simplified view of how this worked, but we color-coded everything to explain what’s happening.

  • PGD nodes and the traffic between them are in orange.
  • The DCS (etcd) and its internal traffic is in purple.
  • The HARP manager and the communication between PGD are in blue.
  • The HARP router and traffic to PGD nodes are in green.
  • Traffic to and from the DCS is in red.

The HARP manager service continuously updates the status of each node in the DCS, and the HARP router obtains the status of the appropriate lead node from the DCS. We used three nodes in this example because it’s the minimum to maintain a voting quorum.

It’s also convenient for service distribution since we can place one of each necessary item on each node. It’s common to split these components up into designated roles. For example, the DCS may be assigned to dedicated nodes so the service can be shared among several database clusters. HARP router services can likewise be decoupled from their respective nodes, allowing for better conceptual abstraction or integration into container-driven environments.

The best part about this design is that it works exceptionally well. Since we control both router and manager components, we can ensure proper failover based on PostgreSQL replication lag, the state of the Commit At Most Once (CAMO) queue, or any number of requirements to prevent conflicts. The primary concerns about this architecture lay in its moderate complexity and dependence on many moving parts, some of which are external to EDB.

How can we address the HARP system's complexity and its reliance on many components? Integration! We discussed our new approach to high availability PostgreSQL routing in a blog, but how did we get here?

Raft is one of the more notorious consensus algorithms used across various infrastructure stacks. It’s the system used by etcd, which is in turn integrated into Kubernetes and is the backbone of infamous HashiCorp tools like Consul, Nomad, and Vault. It should be no surprise that Raft also serves as the consensus model for PGD.

Early iterations of HARP depended on etcd because PGD did not expose its Raft implementation to API calls. Eventually, we added Raft API calls to PGD so HARP could use PGD as its DCS instead of etcd, which removed one point of contention. PGD 5.0 also fully integrated the duties formerly held by the HARP manager. Why have an external process obtain node status and separately update the DCS, when the node and DCS are the same?

Thus, the new design looks more like this:

Compared to the HARP approach, we purged the purple DCS traffic entirely, and the red DCS communication is more direct. The PGD proxy subscribes to PGD status changes through a standard Postgres LISTEN/NOTIFY channel, and PGD updates proxy-related keys as part of its standard operating protocols. Using LISTEN and NOTIFY is important as it prevents excessive polling, which would add more traffic to the Raft layer and potentially disrupt PGD node management.

The final improvement related to connection routing derives from Raft itself. Despite PGD making its API available to HARP, clients often choose to use etcd as the consensus layer. Why?

Our most commonly deployed architecture designates two PGD nodes in a lead/shadow arrangement per location. Whether this location is an availability zone, data center, region or otherwise doesn’t matter. What’s important is that each location can have its etcd database cluster, meaning the HARP router will still function even if half of the PGD cluster is unavailable due to a network partition. If PGD is the DCS, losing half of the cluster means the other half will not be routable, as the DCS will refuse to answer queries without a quorum.

That is no longer the case with EDB PGD 5.0, as we’ve added the capability to create Raft sub-groups. We can now construct a database cluster like this:

This isn’t the best cluster design since it uses an even number of nodes, but it illustrates the point. In this scenario, the PGD proxy in AZ1 would continue to route traffic to the lead node there even if AZ2 vanishes. The Raft sub-group ensures local consensus is possible even when global consensus is unknown. Being independent of global consensus also removes remote consensus latency, allowing faster failover within the sub-group (lower RTO).

It may not seem groundbreaking, but combined with DCS change notifications it allows us to abandon our reliance on etcd for certain functionality. Now, there is nothing etcd can do that PGD itself can’t, all without the extra (and sometimes substantial) configuration and maintenance etcd requires.

HARP started as an innovative concept for Postgres’ high availability and was adopted to a context suitable for PGD. EDB Postgres Distributed 5.0 takes that further by fully embracing the concept and making the DCS state part of the extension, exactly as the original roadmap prescribed. The goal has always been to transform Postgres into a cluster-aware database engine rather than a database engine that could be clustered.

We now have an adequately decoupled design in EDB PGD 5.0: a database cluster and something to connect to. It’s similar to how Oracle works with its Listener service with one important enhancement: consensus. Why use some obscure configuration syntax to choose the correct node when the nodes can inform the proxy where traffic should go?

We’ve crossed the Rubicon here, and we’re only getting started.

More on EDB PGD and High Availability PostgreSQL

Check out our resources on EDB’s premier write-anywhere platform featuring high availability database clusters

How does EDB PGD improve operations for real-world businesses? Find out how its high availability PostgreSQL strengthened German telecom provider telegra’s IT infrastructure and let their team focus on growth.


Why is achieving high availability through physical streaming PostgreSQL replication a serious challenge? Learn how EDB Postgres Distributed addresses this issue to empower users.


EDB PGD is an influential platform among different industry leaders. Discover how PGD and its high availability PostgreSQL allowed three large companies to free up vital resources, build customer satisfaction, and cut down on costs.


What are high availability databases? chevron_right

High availability is a system characteristic that establishes an agreed level of operational performance, usually uptime, for a higher-than-normal period.

High availability databases rely on redundancy, which involves having a backup server ready to take over and perform a database restore seamlessly. Downtime just becomes a quick inconvenience.

Why are high availability databases a necessity? chevron_right

Despite advancements in hardware, network, and database technology, many organizations still risk serious database failures. Sixty percent of data operations have experienced an outage in the past three years, with 60% of these outages having productivity disruptions lasting four to 48 hours. The cost is significant: 70% of outages result in over $100,000 to over $1 million in total losses.

It’s crucial that businesses consider high availability databases and architecture that ensure maximum reliability and continuity.

How is high availability measured? chevron_right

This is usually done by defining and committing to a certain uptime in your service level agreement (SLA). The “three 9s, four 9s, or five 9s” availability percentage corresponds to the amount of time a system would be unavailable.

Availability Downtime
Per Year Per Month Per Week Per Day
9.99% (three nines) 8.76 hours 43.28 mins 10.48 mins 1.26 mins
99.99% (four nines) 52.60 mins 4.38 mins 1.01 mins 8.64 secs
99.999% (five nines) 5.26 mins 26.30 secs 6.05 secs 864 millisecs
When should I consider high availability architecture? chevron_right
  • Determine the level of availability you hope to achieve
  • Understand your operational budget
  • Know the cost to your business if there is downtime in the data persistence tier
  • Understand your RPO (Recovery Point Objective)
  • Know your RTO (Recovery Time Objective)
How does high availability PostgreSQL work? chevron_right

High availability PostgreSQL databases work in two ways:

  • Streaming replication
The replica connects to the primary and continuously receives a stream of WAL records. Streaming replication lets the replica stay more up-to-date with the primary compared to log-shipping replication.
  • Synchronous streaming replication
Databases can also configure streaming replication as synchronous by choosing one or more replicas to be synchronous stand-by. The primary doesn’t confirm a transaction commit until after the replica acknowledges the transaction persistence.
What key technologies power PostgreSQL high availability? chevron_right
  • Repmgr
One of the more “traditional” failover systems, Repmgr was originally for creating PostgreSQL replicas more easily. It’s written in C and uses a customer Raft-like consensus, which means it needs at least three nodes to operate.
  • Patroni
Patroni is the first “modernized” failover system. Written in Python, it doesn’t rely on quorum. It defers consensus handling to an external layer like etcd, and employs a leadership lease that may only be held by one node at a time.
  • Pg_auto_failover
Rather than relying on consensus, the pg_auto_failover high availability tool employs a sophisticated state machine where a single Monitor process makes decisions for the entire cluster, making it a bit of an outlier.
When is standard PostgreSQL replication not enough to maintain high availability? chevron_right

PostgreSQL’s Native Logical Replication (PNLR) has a few fundamental limitations that can affect high availability systems. The examples include but are not limited to:

  • Data Definition Language (DDL) operations are not replicated
  • There is no ability to failover
  • Logical PostgreSQL replication systems require that each row in a replicated table have a primary key
  • PNLR is not integrated with backup and recovery solutions
  • PNLR does not come with best practices and proven architectures for achieving common tasks
  • PNLR only replicates in one direction
How does EDB Postgres Distributed guarantee high availability database clusters? chevron_right

EDB PGD architecture promotes high availability for your database clusters through various techniques:

  • Automatic failover/switchover
  • Uninterrupted online database maintenance
  • Patching the system/Postgres with no impact
  • In-place major version upgrade
  • Increasing resources on the system
What is Active-Active architecture? chevron_right

EDB Postgres Distributed is the first to deliver Active-Active architecture.

Active-Active architecture, or Geo-Distributed Active Architecture, is a data resiliency architecture that allots database information over geographically distributed nodes and clusters. It is a network of separate processing nodes with access to a common replicated database. All nodes can participate in a typical application, which means local low latency with each region capable of running in isolation.

What else can EDB PGD do besides provide high availability PostgreSQL? chevron_right

In addition to providing high availability, EDB Postgres Distributed can also:

  • Distribute workloads geographically
For example, if you have a three-node EDB PGD architecture and these nodes are spread across the globe, you can use each country's local database to manage the respective countries' workload.
  • Provide data localization security
Advanced logical PostgreSQL replication in EDB PGD also allows you to choose access rights and maintain data sovereignty—protecting your organization and limiting threats.

Achieve up to 99.999% uptime even with data-intensive workloads

Find out how your business can process thousands of transactions per second without a hitch and over-deliver on customer expectations