Ensure Data Redundancy and Fault Tolerance for Business Continuity
Learn key strategies for high data availability and disaster recovery in PostgreSQL
Find out why these concepts are critical for business continuity
Data redundancy and fault tolerance are critical mechanisms within PostgreSQL to ensure data integrity, high availability, and overall robust operations.
Data redundancy in PostgreSQL entails creating duplicate copies of data through replication methods. Changes made to the primary database are propagated to secondary replicas. Redundancy is crucial as it prevents data loss and allows users to still access accurate, up-to-date information despite hardware failures.
Fault tolerance in PostgreSQL also involves techniques that bolster the system’s resilience during malfunctions or network outages. PostgreSQL’s main fault tolerance mechanism is Write-Ahead Logging (WAL) – each database transaction is logged before it’s applied to the database. When a crash occurs, the database can then be restored to the last known consistent state by referring to the logged transactions.
Through data redundancy and fault tolerance, PostgreSQL systems are well-equipped to handle unscheduled database downtime. The succeeding sections will dive deeper into how these mechanisms work and are implemented. Understanding them is key as poorly managed downtime can lead to hefty financial losses, long productivity disruptions, and regulatory penalties, not to mention damage to your company’s reputation.
Learn the basics of data replication and data availability in PostgreSQL
Data redundancy and fault tolerance in PostgreSQL rely on two key concepts: data replication and data availability.
Data Replication
PostgreSQL offers two primary types of replication: streaming and logical.
- Streaming replication
In streaming replication, every transaction in PostgreSQL is recorded in a transaction log known as the Write-Ahead Log (WAL). The WAL segments are then continuously replicated from the primary server to standby servers by the WAL sender process on the primary and the WAL receiver process on the standby servers. This process is initiated through specific configurations in the postgresql.conf file and the pg_hba.conf file on both primary and standby servers.
Streaming replication can be set up in various configurations, notably a 1:N (one primary to many standby) setup or a cascade setup where a standby server can further replicate data to another standby. Administrators can choose between synchronous and asynchronous replication modes. Synchronous replication ensures that transactions are only considered committed once they have been confirmed on the standby, enhancing data durability. Conversely, asynchronous replication allows the primary to proceed without waiting for the standby acknowledgment, which may improve performance at the risk of potential data loss in case of the primary server failure. - Logical replication
Logical replication follows a publish-subscribe model. In this setup, a primary instance (the publisher) creates a publication that defines the data set being replicated. Subscribers then create subscriptions to receive changes from this publication. Changes are delivered in real-time, ensuring that the subscriber database remains up to date with the publisher.
Logical replication differs from physical replication by facilitating replication based on SQL commands instead of copying entire database blocks. A publication can encompass entire tables or specific subsets of data, while a subscription manages the connection to the publisher and controls which changes to receive. For instance, a table can be published for INSERT, UPDATE, DELETE, and TRUNCATE operations, allowing for selective control over the type of data changes being replicated.
Data Availability
Data availability refers to a system’s ability to operate continuously by removing the possibility of a single point of failure. PostgreSQL offers several options to achieve high data availability, including:
- Single database instance
A single instance manages all reads and writes and is complemented by standby servers for improved availability. - Single primary instance with read replicas
A primary instance handles all writes, while one or more replicas serve the read traffic, ensuring distribution and load balancing. - Multi-master deployments
Data is distributed across multiple standalone instances. Data is sharded, enhancing both availability and scalability.
Using tools like Pgpool for load balancing and failover can also enhance the management of high availability setups in PostgreSQL.
Learn the basics of streaming replication and logical replication
If you have yet to implement data replication in your systems, it’s best to do so promptly to prevent costly database downtime.
Below are the basic steps for configuring streaming replication in asynchronous mode in PostgreSQL. For more detailed steps, see this blog.
(Streaming replication requires the operating system and PostgreSQL versions to be the same across both primary and standby servers.)
Steps to be performed on the primary database instance
- Review the parameter settings.
- Create /opt/pg_archives folder and set ownership to PostgreSQL superuser.
- Set up authentication on the primary server to allow replication connections from the standby server(s).
- Reload/restart the
PostgreSQdatabase
instance.
Steps to be performed on the standby database instance
Follow any option from the below methods to create an environment for setting up a standby database instance.
Option 1
- Create a new data directory and set up the ownership and permissions.
- Change data_directory location in
/lib/systemd/system/postgresql-14.service file. Environment=PGDATA=/var/lib/pgsql/14/data/
- Execute the command below to implement the changes.
[root@pg ~]# systemctl daemon-reload
Option 2
- Initialize the database instance and remove the entire content from the data directory.
- After creating the requisite data directory, take backup using
pg_basebackup
command (to be executed on the standby server). - For PostgreSQL versions 12 and above, create a blank
standby.signal
file in the data directory location. - Start the PostgreSQL database instance.
- Verify the status of the primary database instance.
- Verify the status of the standby database.
However, streaming replication has its limitations. It does not support replication across differing PostgreSQL versions. It also requires strict user authentication setups or else risks security breaches. Data can be lost in asynchronous mode if the primary server fails before changes are copied to the standby servers. In synchronous mode, network connections must be fully stable; any network issue can impact data consistency.
Logical replication, meanwhile, is better suited for more granular and flexible data handling. It operates on replication identity, often tied to a primary key of the data objects being replicated. This kind of replication can be configured to target specific data subsets, be used with differing PostgreSQL versions or databases, and cater to data transformation by simplifying data preparation. Since engineers can filter out and replicate specific tables or individual data elements, it optimizes bandwidth and storage.
Logical replication also allows for multi-master replication. PostgreSQL does not natively support multi-master configurations, but pairing logical replication with tools like pglogical can create the needed environment.
Learn how to create a solid disaster recovery plan
As reliable as we believe our database systems to be, things can and will go wrong. Server downtime and hardware failures will happen the older your equipment gets, and changes in the market and technology will lead to greater and more complex data demands over time. It’s therefore essential to be proactive rather than reactive and do regular backups for disaster recovery.
Several tools can be integrated with PostgreSQL for backups, such as pg_dump
and pg_basebackup
. pg_dump
can create backups of individual PostgreSQL databases, as well as back up specific tables or schemas to optimize storage and save time. Meanwhile, pg_basebackup
can create a binary backup of the whole database cluster. It enables comprehensive backups, encompassing all database files, configuration files, and transaction logs.
Continuous archiving is also a key backup strategy, combining file system-level backups with Write Ahead Log (WAL) file backups. It’s great for Point-in-Time Recovery (PITR), restoring a database to any specific instance since the last backup. Continuous archiving creates a highly available warm standby system that is useful in case of primary database failure.
These backup strategies should be part of a greater disaster recovery plan. Here are the key steps to creating one:
- Define the objectives and scope. Identify critical database systems, disaster types such as natural disasters or cyber attacks, and goals such as minimizing downtime and protecting data integrity.
- Assess risks. Identify potential threats to the database that could lead to data loss and outages, such as hardware failure.
- Set a clear Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is how long the organization can afford not to have database access, while RPO is the maximum acceptable amount of data loss in terms of time.
- Set a backup strategy. The strategy should include daily or weekly full backups and at least hourly incremental backups. These backups should be stored offsite to protect against local disasters.
- Determine recovery procedures. Assess the severity of the disruption and check the state of database backups. Then, restore the most recent backup to a designated recovery server, check data integrity before bringing the database back online, and update stakeholders on your progress.
- Assign roles to team members. Ensure there is a designated disaster recovery team of IT experts, operators, and managers, each with clear roles and responsibilities.
- Practice testing and validation. Test this plan with the disaster recovery team regularly using different types of disaster scenarios.
- Document everything. Ensure that the plan is written down and easily accessible, that incidents are thoroughly documented, and that stakeholders are properly informed during and after incidents.
- Make improvements. Based on learnings from previous incidents and tests, and changes in technologies and business environments, make improvements to the plan for best results.
Know what RPO and RTO are and how best to achieve them
A clear RPO and RTO are essential when designing a solid disaster recovery plan, as they are measurable requirements to help guide backup strategies.
RPO stands for Recovery Point Objective, an organization’s maximum acceptable amount of data loss in terms of time. It indicates how far back in time data must be recoverable to resume normal operations without issue. For instance, if an organization has an RPO of 30 minutes, backups should be performed every 30 minutes to minimize potential data loss.
Meanwhile, RTO stands for Recovery Time Objective, the target duration required to restore a system after failure. More specifically, it measures the maximum acceptable time a database can be offline before operations are negatively affected. For instance, if an organization has an RTO of two hours, the time between the event that caused downtime and operations returning to normal should be two hours.
PostgreSQL can help achieve your desired RPO and RTO by implementing high availability architecture, ensuring access to data even during outages. It can do streaming replication, continuously copying data from the primary server to standby servers for minimal data loss; offers Point-in-Time Recovery (PITR), allowing the database to be restored to an exact moment in time; and enables automated backup schedules to happen right in line with the RPO.
Learn real-world applications of PostgreSQL’s data redundancy and fault tolerance features
Many businesses today rely on PostgreSQL for stable database operations. Here are a few examples:
Zuchetti
Processing up to 1 million electronic invoices daily, Italian IT company Zuchetti couldn’t afford to lose any data from an outage as it would cause fiscal problems for its customers. By integrating EDB’s Backup and Recovery Manager (Barman) with its PostgreSQL system, the company gained an enhanced ability to control and confirm the security of its databases. It could verify security and restoration across multiple systems and even simulate the restore process to ensure their system was always in order.
Regions Financial Corporation
Regions Financial Corporation is one of the nation’s largest full-service consumer and commercial banking, wealth management, and mortgage solutions providers. Aiming to modernize its database platform, it chose EDB Postgres partly for its high availability since banks need to keep their mobile and web platforms running smoothly. For instance, fluctuations in interest rates could drive sudden surges in banking traffic, so the company needed a system that could remain stable or quickly bounce back in case of issues.
Murex
Global software provider Murex offers asset management solutions for complex financial instruments and high trade volumes. Downtime isn’t an option for their 60,000 clients, who rely on their specialized platform for trading, hedging, funding, risk management, and processing operations. Murex collaborated with EDB to integrate Postgres into their MX.3 platform in their move to open source and the cloud. This change provided their clients with high availability for trading, essential for capitalizing on short-lived opportunities, managing risk, and meeting compliance standards for uptime.
Linxup
Linxup develops vehicle and asset-tracking solutions for fleets and service companies in various industries. Tracking more than 200,000 vehicles, the company was processing 11 TBs of data, and its systems were bursting at the seams. It was initially running on an older version of PostgreSQL, and an upgrade was needed, but the company was worried it would cause days of downtime. However, by harnessing EDB Postgres Distributed, Linxup upgraded to the latest Postgres version with zero downtime and now enjoys a more robust and seamless disaster recovery setup.
A well-configured database management system ensures business continuity
Reliability, despite glitches or threats, is a top requirement for any database management system. It’s especially valuable now that many enterprises constantly manage large amounts of critical data. Ensuring data redundancy and fault tolerance in your system is therefore a must.
PostgreSQL provides several features and strategies for high data availability. It can execute streaming and logical replication for minimal data loss, be integrated with tools like pg_dump
and pg_basebackup
for robust backup strategies, offer Point-in-Time Recovery (PITR) to achieve optimal RPO and RTO, and more.
Many companies are assured of business continuity by harnessing these strengths. Your company can too, with the help of EDB’s Postgres experts.
EDB’s experts dive deeper into PostgreSQL and data availability
Get a step-by-step guide to building robust PostgreSQL clusters for optimal replication
Learn about synchronous replication’s costs, performance, and database restore challenges
Discover how government organizations worldwide employ Postgres’ high availability features
High availability is a system characteristic that establishes an agreed level of operational performance, usually uptime, for a higher-than-normal period.
High availability databases rely on redundancy, which involves having a backup server ready to take over and perform a database restore seamlessly. Downtime becomes a quick inconvenience.
Despite advancements in hardware, network, and database technology, many organizations still risk serious database failures. Sixty percent of data operations have experienced an outage in the past three years, with 60% of these outages having productivity disruptions lasting four to 48 hours. The cost is significant: 70% of outages result in over $100,000 to over $1 million in total losses.
Businesses must consider high availability databases and architecture that ensure maximum reliability and continuity.
This is usually done by defining and committing to a certain uptime in your service level agreement (SLA). The “three 9s, four 9s, or five 9s” availability percentage corresponds to the amount of time a system would be unavailable.
Best practices include synchronous replication for critical data, connection pooling, failover strategies, and regular backups.
Streaming replication utilizes Write-Ahead Logs (WAL) for real-time data transfer; the logs are sent from the primary to the standby server as transactions occur. This ensures data consistency across servers in real-time.
Logical replication employs a publish-and-subscribe model; the publisher node sends changes to the subscriber nodes. Data changes are replicated based on a specified replication identity, typically a primary key. This is better for more selective data distribution.
Effective disaster recovery strategies for PostgreSQL include using robust backup tools like pg_dump and WAL, setting clear Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO), and employing offsite storage solutions to protect backups from local disasters.
- Repmgr
One of the more “traditional” failover systems, Repmgr was originally for creating PostgreSQL replicas more easily. It’s written in C and uses a customer Raft-like consensus, which means it needs at least three nodes to operate. - Patroni
Patroni is the first “modernized” failover system. Written in Python, it doesn’t rely on quorum. It defers consensus handling to an external layer like etcd, and employs a leadership lease that may only be held by one node at a time. - Pg_auto_failover
Rather than relying on consensus, the pg_auto_failover high availability tool employs a sophisticated state machine where a single monitor process makes decisions for the entire cluster, making it an outlier.
Automating backups in PostgreSQL can be achieved using tools like pg_dump for logical backups and Barman or WAL-E for continuous Write-Ahead Log (WAL) archiving. These tools help streamline the backup process, ensuring regular data preservation without manual intervention.
PostgreSQL’s Native Logical Replication (PNLR) has a few fundamental limitations that can affect high availability systems. The examples include but are not limited to:
- Data Definition Language (DDL) operations are not replicated
- There is no ability to failover
- Logical PostgreSQL replication systems require that each row in a replicated table have a primary key
- PNLR is not integrated with backup and recovery solutions
- PNLR does not come with best practices and proven architectures for achieving common tasks
- PNLR only replicates in one direction
EDB PGD architecture promotes high availability for your database clusters through various techniques:
- Automatic failover/switchover
- Uninterrupted online database maintenance
- Patching the system/Postgres with no impact
- In-place major version upgrade
- Increasing resources on the system
EDB Postgres Distributed is the first to deliver Active-Active architecture.
Active-Active architecture, or Geo-Distributed Active Architecture, is a data resiliency architecture that allots database information over geographically distributed nodes and clusters. It is a network of separate processing nodes with access to a common replicated database. All nodes can participate in a typical application, which means local low latency with each region capable of running in isolation.
And avoid costly database downtime
Experience high availability and business continuity with EDB Postgres AI. Get database management solutions fit for your needs and access to top-tier Postgres experts whenever you need them.