Mastering Postgres High Availability: Key Metrics and Best Practices

November 12, 2024

Everyone talks about high availability for databases, but what does this really mean? How does disaster recovery relate to high availability? And what are the most effective ways to enhance database availability?

To address these questions and more, we brought together Gianni Ciolli, EDB’s Global VP Practice Lead and Petr Jelinek, EDB’s VP Chief Architect for Postgres Distributed, for a webinar discussion moderated by EDB’s Director of Product Marketing, Angel Camacho. With their collective insights and tips, you can help ensure your data remains highly available whether downtime is planned or unplanned.

Here are a few of the key insights from the webinar:

Defining high availability

Gianni Ciolli kicks off the discussion by affirming that everyone has their own interpretation of high availability. But basically, high availability is a component of business continuity that ensures systems stay operational during both planned and unplanned events. The core goal is to minimize downtime and data loss. Though zero data loss is ideal, organizations often make practical trade-offs between complete data preservation and cost, especially for less critical systems.

“High availability is part of business continuity, which is the ability  of a system to keep working through incidents that are unplanned events or through planned events like maintenance. And in order to have high availability, you need to prepare it in advance, because if you don't prepare the day you have an incident, your system stops working and so you don't have the high availability.

– Gianni Ciolli, EDB’s Global VP Practice Lead 

Preparation is the key

Achieving high availability requires careful planning and comprehensive monitoring. This starts with identifying potential scenarios like hardware failures, user errors, application bugs, and routine maintenance needs.

Two critical metrics need to be established for each scenario: 

  • Recovery Time Objective (RTO) which defines how quickly systems must be restored
  • Recovery Point Objective (RPO) which specifies acceptable data loss limits

These metrics vary by business requirements. Banks, for instance, require zero data loss to protect financial transactions while other applications might tolerate some data loss to reduce system costs. Note that ultimately it’s up to the business owners and not technologists to decide what risks are acceptable.

Monitoring database metrics (system health, hardware status, connectivity) along with application performance (connection stability, query speed, regional access) is critical. A comprehensive view is essential since problems can occur outside the database itself.

Enhance availability by testing

After defining your requirements and acceptable risks and identifying the most likely failure scenarios, a response and recovery procedure should be designed for each scenario. Testing these procedures regularly is important because external factors can compromise recovery plans. Unavailable software packages, third-party bugs, or changed network settings can all derail recovery procedures that were previously working fine. 

While regular testing doesn't guarantee 100% perfection, it significantly increases the chances that your recovery procedures will work when you need them.

Managing planned downtime, upgrades and maintenance

Effective system design minimizes or eliminates planned downtime by allowing maintenance on individual components while keeping the overall system operational. This enables upgrades and maintenance to be performed on a subset of nodes or locations first, verifying their functionality and then gradually updating the remaining components.

This rolling upgrade approach applies to operating systems, PostgreSQL versions, and even database schema changes. While schema modifications may require table locks that temporarily limit access, distributed PostgreSQL systems can maintain overall availability by implementing changes one node at a time.

What about unplanned downtime?

Formula One racing provides an allegory for planned and unplanned downtime. With Formula One, failure isn’t an option, because failure means you won’t win.

  • Planned downtime is like a scheduled pit stop to change tires. Tire wear is monitored through telemetry and stops are strategically planned so tires can be changed in mere seconds.
  • Unplanned downtime is like a sudden tire puncture during the race, where the driver needs to react immediately to prevent a crash and limp to the pit stop to change the tire. Two or three valuable minutes may be lost during this process.

Just like racing, if your database event is planned, you can prepare for the change and address it with minimal downtime. If it's unplanned, you have to react to the incident without preparation, so there’s more at risk and potentially a longer recovery process.

Is disaster recovery the same as high availability?

One of the jobs of a highly available system is to have as few disaster recovery incidents as possible. High availability systems aim to prevent disasters through redundancy and fault tolerance. Disaster recovery comes into play though when high availability measures fail due to unforeseen circumstances.

Even a highly available system featuring multiple copies of data across multiple regions requires backups. Backups protect against scenarios like data corruption that replicates across copies, human errors like accidental table drops, or the simultaneous failure of multiple data centers. 

Effective disaster recovery requires not just making backups, but regularly testing the complete recovery process to ensure business continuity when all other safeguards fail.

To err is human – good thing there’s a fix for it

Errors happen, whether it’s a human error or a software bug (arguably a type of human error). When someone accidentally deletes or modifies data through legitimate database commands, these changes instantly replicate across all copies. That’s because databases are remarkably efficient – they do what they are told, and unfortunately, they can't distinguish between intended changes and mistakes.

PostgreSQL's point-in-time recovery capability offers protection against such errors by allowing restoration to a specific moment before the mistake occurred. If a mistake is made at 5:00, for instance, you can go back to 4:59 seconds to fix it. Perhaps you can’t change history, like Superman, but at least you can change a point in time.

While there’s no way to prevent all human errors, you can minimize their impact by maintaining backups – ideally in a separate location like the cloud. This provides a crucial last line of defense when other high availability measures can't help.

Additional fixes for human error

Postgres Distributed also offers delayed replicas that intentionally lag behind the primary database by a set time period. If an error is detected within that window (say, within 30 minutes on a one-hour delayed replica), you can recover data from the lagging node.

However, the best protection against human error is automation. When the database system automatically handles node failures, recovery, and cluster reintegration, it eliminates opportunities for operator mistakes. While management systems can help, having these protective features built directly into the database software, like with EDB Postgres Distributed, provides the most reliable defense against human error.

More high availability best practices

Upgrading - Gianni points out that today, many Formula One races are held on the same tracks as they were 30 years ago. And Formula One cars are now achieving better lap times – but with half the fuel consumption of 30 years ago.With Formula One, upgrades improve performance, reliability, and safety, and that’s true with PostgreSQL too.

While database improvements may not be as easily quantifiable as Formula One’s fuel efficiency gains, the technological advances make upgrading the safest choice for protecting your data and maintaining high availability and performance.

Continual testing - Petr Jelinek mentions that one of the most important practices is repeated testing. Just as offices conduct fire drills to ensure fire alarms work, regular system drills can help you pinpoint potential problems and address them preemptively. By monitoring system performance, you can make the modifications needed to ensure long-term high availability. 

Biggest wins with EDB Postgres AI

Instead of winning a single race or resolving a specific disaster, GIanni says that one of the biggest achievements is that EDB Postgres Distributed has become a standard payment platform for moving millions of dollars daily. The fact that payments aren’t lost or accidentally duplicated speaks to the reliability and robustness of EDB Postgres.

Petr agrees the biggest win is the fact that EDB’s customers are using our technology without any major issues. While reliability may be boring, it’s exciting when your system is highly available and your business is operating as planned, every single day.

To learn more about high availability, best practices, watch the full webinar.

 

Share this