Validating Postgres Distributed for High Availability and Consistency

Vaijayanti Bharadwaj

September 13, 2024

Learn how to achieve high availability PostgreSQL clusters and reliable transactions in geo-distributed databases

EDB Postgres Distributed (PGD) provides best-in-class high availability for always-on, enterprise-grade Postgres performance across fully managed or self-managed deployments in any cloud. With EDB’s Postgres technology supporting robust, globally distributed applications that process thousands of transactions per second, customers regard EDB Postgres Distributed as their go-to for delivering up to 99.999% uptime and improving data replication performance by up to 5X.

As part of internal activities leading up to our latest EDB Postgres Distributed release, our Engineering team focused on delivering more robust releases that enhance customer confidence in PGD. Setting correct expectations for customers prompted us to improve our internal testing processes in efforts influenced by the Jepsen database testing framework.

Validating EDB Postgres Distributed for Safety, Correctness, and Consistency

In the Jepsen framework, a Jepsen control node has the Jepsen libraries and plugins for PGD and tests. This talks to the PGD cluster, as shown below.

This blog is the first in a series of posts, as EDB extends our commitment to validate PGD’s availability and consistency in geo-distributed databases.

In the following subsections, we’ll walk through our internal testing and how it validates EDB Postgres Distributed’s availability and consistency in active/active architectures.

Background and Terminology for Database Transactions

Before describing our internal PGD testing enhancements, let’s provide some background and terminology for database transactions.

Operation: An operation (e.g., read or write) is carried out by a client. This operation can take some time to complete. A modifying operation such as a write can occur sometime between invocation and completion. Multiple operations can be in progress simultaneously.

Process: An operation is performed by a logical process. A process can do only one operation at a time.

History: A history is a set of operation invocations and completions with their timestamps. An example is provided below.

invocation process 0:

{:process 0, :type :invoke, :f :read, :value nil}

invocation process 1:

{:process 1, :type :invoke, :f :write, :value 3}

completion process 1:

{:process 1, :type :info, :f :write, :value 3}

completion process 0:

{:process 0, :type :ok, :f :read, :value 3}

Testing Linearizability in Database Transactions

We used the Knossos checker to verify that a history generated from a set of concurrent operations is linearizable. It models a register with operations read, write, and compare-and-swap (CAS). Given a set of concurrent operations, the system is validated as linearizable for that set if there is at least one possible path in the history that is linearizable.

Linearizable system

A linearizable implementation implies that the following properties are met:

The system should support these operations:
1. read (key)
2. write (key, value)
3. compare_and_swap (key, old_value, new_value)
In the presence of multiple concurrent reads and writes from multiple nodes, a read should never return a stale value. A stale read returns, say, a value V, and there is at least one modification to the key that was successful and completed before the key was read that set the value to V2.
As the system moves from one state to another, the state of the system seen should always progress forward. Returning a stale value is like going backward.
There should be a total order on all the operations.

In a distributed system, an operation is applied to the system at a certain time. It may take some time to complete and if it is a modifying operation, the modification takes effect sometime between start time and completion time. The diagrams that follow show how operations proceed concurrently on a distributed system.

The first diagram below provides an example of non-linearizable behavior. After a write has been applied and completed, a subsequent read returns an older value. A linearizable system cannot go “backward”.

The second diagram below shows the behavior of a linearizable system. There is a total order on the operations because the system moves from a state of A = 2, A = 4, A = 8… and clients see the system moving forward. An operation takes some time to take effect, but once it takes effect, every node sees the effect of that operation. Thus, the system behaves as one system.

A look at EDB’s implementation of Raft in PGD

PGD uses Raft as a consensus protocol for certain key operations. While Raft does not come in the data path of transactions, it is used for configuration management, electing a write leader for a subgroup, serializing DDL and global locks, and arriving at a consensus in distributed transaction reconciliation.

One of our first steps was to test the linearizability of PGD’s Raft implementation. We used a key-value API provided by the Raft subsystem to test linearizable behavior. The read, write, and CAS calls are implemented through Raft.

A failed operation appears in PINK, and a successful operation appears in BLUE. Operations may also return with uncertainty, which means it may or may not have been executed, and the Knossos checkers factor this.

Using this API to test linearizability yields timeline charts provided in the following sample exhibits.

Example of correct execution

Example of invalid analysis

Linearizable reads and Raft

The tests showed some processes reading stale values, as seen in the above example. This is a bug or at least a linearizability violation. The reason is that key-value Raft implementation in PGD reads from the leader. The leader should have the latest value because it acknowledges the request only after getting an acknowledgment from the majority of replicas that they have committed the change and then applying the changed state locally. But if there is a leader change, a new leader is elected. The new leader must have the change committed in its Raft logs, which may not have been applied to the key value store. For the new leader, the apply_index may lag the commit_index. And on a read, it may return a stale value.

PGD code that uses Raft for various use cases, such as configuration management, waits if the applied value needs to be seen locally. We created an explicit SQL call to wait until the apply index catches up with the commit index, and when using this, the test succeeded.

Lessons Learned

During this process, we realized some valuable lessons.

An API definition needs to be precise and allow callers to infer if something has failed or there is uncertainty. For example, throwing errors with the right details, such as a request failing due to a timeout expiring, allows the caller to infer that there is uncertainty in the result of the request. It may or may not have succeeded. For tests to accurately check results, the requests need to return :ok (success), :fail (failure), and :info (uncertainty). An :info result can mean the request could have succeeded or failed. Wrongly considering a request as failed when it could have succeeded gives false alarms in the tests.
Some tests that use a register workload with the Knossos checker have very high space and time complexities. If the test runs with fewer keys, it can take a very long time to verify and uses up memory as well. Its duration needs to be short. The verification process can run out of memory if the duration is long. If the duration is kept short, bugs may not get caught. Also, if the number of keys increases, the bug may not be hit in the test. Similarly, creating a lot of contention can result in the code hitting deadlocks and slowing down the progress of the test. We are working to strike a balance between these to make sure the code gets tested better.

Status and Work-in-Progress Activities

As a result of writing and running these tests, we were able to target a few areas of next-stage focus, including:

Helping identify some rare cases when update conflicts, in the presence of write skew, cause data divergence among nodes.
Improving PGD error handling on replicas in the presence of conflicts.
Helping identify gaps in eager and async conflict resolution.

We continue to refine our efforts to validate PGD’s availability and consistency in geo-distributed databases with various configurations, and we’ll share those findings in future blogs, focusing on:

Conflict resolution with CRDTs
Qualify write lead with majority commit scope for strict serializability
Qualify write lead with a synchronous commit for linearizability
Adding tests for the correctness of global locks

For more information about the latest EDB Postgres Distributed release, check out our EDB Docs.

High Availability PostgreSQL: FAQ

What are high availability databases?

High availability is a system characteristic that establishes an agreed level of operational performance, usually uptime, for a higher-than-normal period.

High availability databases rely on redundancy, which involves having a backup server ready to take over and perform a database restore seamlessly. Downtime becomes a quick inconvenience.

Why are high availability databases a necessity?

Despite hardware, network, and database technology advancements, many organizations still risk serious database failures. Sixty percent of data operations have experienced an outage in the past three years, with 60% of these outages having productivity disruptions lasting four to 48 hours. The cost is significant: 70% of outages result in over $100,000 to over $1 million in total losses.

Businesses must consider high availability databases and architecture that ensure maximum reliability and continuity.

How is high availability measured?

This is usually done by defining and committing to a certain uptime in your service level agreement (SLA). The “three 9s, four 9s, or five 9s” availability percentage corresponds to the amount of time a system would be unavailable.

What are the best practices for replication in PostgreSQL?

Best practices include synchronous replication for critical data, connection pooling, failover strategies, and regular backups.

What is the difference between streaming and logical replication in PostgreSQL?

Streaming replication utilizes Write-Ahead Logs (WAL) for real-time data transfer; the logs are sent from the primary to the standby server as transactions occur. This ensures data consistency across servers in real-time.

Logical replication employs a publish-and-subscribe model; the publisher node sends changes to the subscriber nodes. Data changes are replicated based on a specified replication identity, typically a primary key. This is better for more selective data distribution.

What strategies are effective for disaster recovery in PostgreSQL?

Effective disaster recovery strategies for PostgreSQL include using robust backup tools like pg_dump and WAL, setting clear Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), and employing offsite storage solutions to protect backups from local disasters.

What key technologies power PostgreSQL high availability?

Repmgr
One of the more “traditional” failover systems, Repmgr was originally for creating PostgreSQL replicas more easily. It’s written in C and uses a customer Raft-like consensus, which means it needs at least three nodes to operate.
Patroni
Patroni is the first “modernized” failover system. Written in Python, it doesn’t rely on quorum. It defers consensus handling to an external layer like etcd and employs a leadership lease that may only be held by one node at a time.
Pg_auto_failover
Rather than relying on consensus, the pg_auto_failover high availability tool employs a sophisticated state machine where a single monitor process makes decisions for the entire cluster, making it an outlier.

How can I automate backups in PostgreSQL?

Automating backups in PostgreSQL can be achieved using tools like pg_dump for logical backups and Barman or WAL-E for continuous Write-Ahead Log (WAL) archiving. These tools help streamline the backup process, ensuring regular data preservation without manual intervention.

When is standard PostgreSQL replication not enough to maintain high availability?

PostgreSQL’s Native Logical Replication (PNLR) has a few fundamental limitations that can affect high availability systems. The examples include but are not limited to:

Data Definition Language (DDL) operations are not replicated
There is no ability to failover
Logical PostgreSQL replication systems require that each row in a replicated table have a primary key
PNLR is not integrated with backup and recovery solutions
PNLR does not come with best practices and proven architectures for achieving common tasks
PNLR only replicates in one direction

How does EDB Postgres Distributed guarantee high availability database clusters?

EDB PGD architecture promotes high availability for your database clusters through various techniques:

Automatic failover/switchover
Uninterrupted online database maintenance
Patching the system/Postgres with no impact
In-place major version upgrade
Increasing resources on the system

What is active-active architecture?

EDB Postgres Distributed is the first to deliver active-active architecture.

Active-active architecture, or geo-distributed active architecture, is a data resiliency architecture that allots database information over geographically distributed nodes and clusters. It is a network of separate processing nodes with access to a common replicated database. All nodes can participate in a typical application, which means local low latency with each region capable of running in isolation.

What else can EDB PGD do besides provide high availability PostgreSQL?

In addition to providing high availability, EDB Postgres Distributed can also:

Distribute workloads geographically

For example, if you have a three-node EDB PGD architecture and these nodes are spread across the globe, you can use each country's local database to manage the respective countries' workload.

Provide data localization security

Advanced logical PostgreSQL replication in EDB PGD also allows you to choose access rights and maintain data sovereignty — protecting your organization and limiting threats.

In this Article

Validating EDB Postgres Distributed for Safety,...
Background and Terminology for Database Transac...
Testing Linearizability in Database Transactions
Lessons Learned
Status and Work-in-Progress Activities

Resource Feature Callout 1

Validating Postgres Distributed for High Availability and Consistency

Vaijayanti Bharadwaj

Validating EDB Postgres Distributed for Safety, Correctness, and Consistency

Background and Terminology for Database Transactions

Testing Linearizability in Database Transactions

Linearizable system

A look at EDB’s implementation of Raft in PGD

Example of correct execution

Example of invalid analysis

Linearizable reads and Raft

Lessons Learned

Status and Work-in-Progress Activities

High Availability PostgreSQL: FAQ

Get 99.999% Uptime with EDB Postgres Distributed

High availability PostgreSQL is possible with the help of our experts

Why the Future of Postgres Extensions in Kubernetes is Immutable and Flexible

Minimal downtime Postgres major version upgrades with EDB Postgres Distributed

Integrating EDB's AI Accelerator with Nvidia NIM

Resource Feature Callout 1

Validating Postgres Distributed for High Availability and Consistency

Vaijayanti Bharadwaj

Validating EDB Postgres Distributed for Safety, Correctness, and Consistency

Background and Terminology for Database Transactions

Testing Linearizability in Database Transactions

Linearizable system

A look at EDB’s implementation of Raft in PGD

Example of correct execution

Example of invalid analysis

Linearizable reads and Raft

Lessons Learned

Status and Work-in-Progress Activities

High Availability PostgreSQL: FAQ

Get 99.999% Uptime with EDB Postgres Distributed

High availability PostgreSQL is possible with the help of our experts

More Blogs

More Blogs

Why the Future of Postgres Extensions in Kubernetes is Immutable and Flexible

Minimal downtime Postgres major version upgrades with EDB Postgres Distributed

Integrating EDB's AI Accelerator with Nvidia NIM