Postgres-BDR: It is also about fast safe upgrades

October 15, 2019
Postgres-BDR: It is also about fast safe upgrades

 

In the past year, I have had conversations with many clients and prospects about Postgres-BDR. Postgres-BDR has been in production for many mission critical deployments for quite some time. We have more clients coming on board each month.

Recently, I have been meeting with companies whose business is in either Telecommunications. In such industries, down time is measured in money. The systems need to be available 24×7. In other words the systems need to be “Always-On”.

Postgres is an incredibly stable, feature rich database with a permissive license. For these reasons, and many others, it is widely adopted.

Despite the wide adoption of Postgres, there are availability challenges. These challenges are usually dealt with using third party products or technologies that are part of the open source Postgres ecosystem. Many 2ndQuadrant customers use repmgr (www.repmgr.org) to handle switchover and failover between streaming replicas.

Just about any automated master-standby failover solution has a 1-2 minute outage time if a master fails. Typically waiting sixty seconds to confirm the primary is down via a series of pings and subsequently promoting a standby to become a master. In the master standby base configuration, conservatively confirming the primary down is very important to avoid a false positive failover scenarios. Since a standby is going to be promoted, you want to be certain the existing master is down. If you failover when the master is not in fact down, you will have serious problems that I will describe in a separate blog article.

Since Postgres BDR is a Master – Master solution, the consequences of failing over are not nearly as high, a DBA can be much more aggressive in deciding if a primary has failed. In addition, the time it takes to promote a standby to a master is eliminated. A switchover can be completed in under a second. A failover can be completed as quickly as you comfortable with (under a second in some cases). You may need to deal conflict resolutions but this is far simpler than dealing with a split brain or the cost of a two-minute outage. I have skimmed over the details but we do in fact have customers using BDR to increase availability in this manner. So up to 1 minute and 59 seconds of downtime eliminated.

What about the down time required for upgrades? This is equally important.

Upgrades include application upgrades, Postgres upgrades, BDR  upgrades and upgrades of the operating system and the hardware.  At times when considering a high availability solution, the down time for these tasks are often removed from the availability equation as “planned outages”. However, the system is just as unavailable as it would be in a failover scenario and you are likely costing the business you support money. Perhaps at a more convenient time for the user base, but with user bases being global these days there is never a good time for an outage. In addition, upgrades can involve a critical bug fixes that can’t wait for a semi-convenient maintenance window.

Some Postgres upgrade scenarios can take over 30 minutes of down time to complete and are difficult to test. Even a minor Postgres version upgrade requires a server restart. All of these tasks require down time. The truth is you may never encounter a failover. However, you will definitely need to upgrade. So outages with respect to upgrades are very important to consider and plan for. In the case of a BDR deployment you can simply not have such outages.

An often-overlooked fact is that BDR nodes can have different versions of Postgres, different versions of BDR and in some cases different versions the application across different nodes. The BDR protocol along with logical replication will handle data synchronization between these versions for you.

You can simply introduce a new BDR node in the cluster, with a more recent version of Postgres, wait for the data to be synchronized across the nodes and when you are ready, switch the application traffic to the upgraded node. No downtime!

If that is not enough to convince you to consider Postgres-BDR for upgrades, to increase your availability how about this: When you upgrade Postgres there is rarely a way back once users start writing to the new database. Performance goes ugly, errors start happening you now have a fire drill that must be fought until the fire goes out. It can go on for days or weeks and in the worst-case scenario result in a potential long outage while you figure out how to get your users back to the old software (if you in fact can).

In the case of upgrading via introducing a BDR node if things go bad, switch the traffic back to the old node. No fires, no pain no data loss. Yes, you can test, test, test to ensure you won’t encounter issues with an upgrade. However, how much testing is enough? Most of us have seen an upgrade once put into production result in a problem we didn’t consider.

Still not convinced? Consider this, if you have multiple BDR nodes in the cluster with different versions you can have some of your users run on the new software for a while before migrating all of them. You can have a rolling upgrade across your user base in a DevOps best practice manner.

The term Blue-Green upgrades where you can rollback in the event of a failure can now apply to Postgres and not just the application tier.

There are many great uses cases that can be accomplished by Postgres BDR such a geographic data sharding, getting the data close to the user and write availability. However, for fast reliable database online upgrades and Blue-Greeen deployments, BDR is also a great (and often over looked) solution.

 

 

Share this