Explaining ABI Breakage in PostgreSQL 17.1

Pavan Deolasee

December 06, 2024

PostgreSQL comes out with a scheduled major release every year and scheduled minor releases for all supported versions every quarter. But in the November minor releases, two issues caused the PostgreSQL project to announce [1] the first out-of-cycle release in at least 5 years. In this post we'll share how EDB discovered and raised the potential ABI breakage issue.

What went wrong this time?

Almost immediately after the 17.1 (and other minor versions) were released, a couple of significant issues were discovered. Etienne Lafarge reported [2] that the new minor release broke ALTER USER .. SET ROLE in a way such that the specified role doesn’t take effect when the user logs in.

Secondly, we at EDB found out that release potentially broke the Application Binary Interface (ABI). This was posted [3] to the mailing lists as soon as we discovered it.

This article focuses on the ABI breakage issue and why it’s so important for PostgreSQL to maintain a stable ABI with its growing extension ecosystem. So let’s try to understand what ABI is. Wikipedia article [4] on Application Binary Interface defines it as:

An application binary interface (ABI) is an interface between two binary program modules. Often, one of these modules is a library or operating system facility, and the other is a program that is being run by a user.

Let’s see this via an example. We are creating a simple shared library which provides a routine to print marks obtained by students in various subjects.

shlib.h


typedef struct Marks
{
	/* student roll number */
	int		rollno;
	/* marks obtained in various subjects */
	int		phy;
	int		chem;
} Marks;

extern void print_marks(Marks m[], int count);

shlib.c

#include <stdlib.h>
#include <stdio.h>
#include "shlib.h"


void
print_marks(Marks m[], int count)
{
	for (int i = 0; i < count; i++)
		printf("Student %d scored phy=%d, chem=%d\n",
				m[i].rollno,
				m[i].phy,
				m[i].chem);
}

Let’s package this as a shared library. Any user can then make use of the exported structure and function to use the facility offered by the library.

$ gcc -c -Wall -Werror -fpic shlib.c
$ gcc -shared -o libshlib.so shlib.o

This produces a shared library named libshlib.so that another program can make use of. For example, we have a program which collects the students marks and uses the library to print the information.

print.c

#include "shlib.h"
int
main(void)
{
	Marks m[] = {
		{1, 99, 95},
		{2, 98, 100}
	};
	print_marks(m, 2);
}

Let's compile this into an executable program and then run the program.

$ gcc print.c -L. -lshlib -o program 
$ ./program 
Student 1 scored phy=99, chem=95
Student 2 scored phy=98, chem=100

Let’s now change the shared library so that it supports an additional subject and also tracks if the student had written the exam or not.

shlib.h

typedef struct Marks
{
	/* student roll number */
	int		rollno;
	/* marks obtained in various subjects */
	int		phy;
	int		chem;
	int 	math;
	/* did student write the exam? */
	char	absent;
} Marks;
extern void print_marks(Marks m[], int count);

shlib.c

#include <stdlib.h>
#include <stdio.h>
#include "shlib.h"
void
print_marks(Marks m[], int count)
{
	for (int i = 0; i < count; i++)
		printf("Student %d scored phy=%d, chem=%d, math=%d\n",
				m[i].rollno,
				m[i].phy,
				m[i].chem,
				m[i].math);
}

$ gcc -c -Wall -Werror -fpic shlib.c
$ gcc -shared -o libshlib.so shlib.o

The shared library is now ready to support the new structure. But the program that was relying on the ABI to remain unchanged was not rebuilt. If we execute the program again, it will print garbage or may even crash.

$ ./program 
Student 1 scored phy=99, chem=95, math=2
Student 100 scored phy=1283063832, chem=-1546250667, math=1800121120

This is because the program still thinks that the struct contains only 3 integers (total size 12 bytes), but the shared library’s view has changed. It expects the struct to contain 4 integers and a character, so total 20 bytes, including padding.

Similarly, if the prior version of the shared library declared an external function that takes 2 arguments, when the user program was compiled on top of it, the compiler would have generated code expecting 2 arguments and accordingly reserved space on the stack. But if the shared library code is now changed such that the same function takes 3 arguments, it will break the agreement between the user program binary code and the shared library binary code. The result can then be unpredictable.

In the context of PostgreSQL, when the packages are built, various binaries and libraries are produced and made available as part of the distribution. Along with those binaries and libraries, a set of header files are included too. These header files contain the information about various global variables, structures, enums, function declarations and many other things that extension writers typically rely on while writing extensions on top of Postgres. So when extensions are compiled and the binaries or libraries are produced, they have a certain agreement between them and the Postgres libraries. When that agreement breaks, the result can be quite unpredictable.

A problem occurred because Postgres had earlier declared that a certain struct is of a certain size, but then broke that guarantee and increased the size of that struct. If such a struct is now allocated by an extension and passed down to Postgres’s library, it can confuse these two separately built binary codes. That’s exactly what happened when PostgreSQL 17.1 was released on Nov 14th. Postgres added a new member to the ResultRelInfo struct, thus increasing the overall size of the struct.

    /* updates do LockTuple() before oldtup read; see README.tuplock */
    bool        ri_needLockTagTuple;
} ResultRelInfo;

How did we at EDB find it?

If we go back to our previous example, you would realise that the change in the struct’s size can break ABI and cause problems. We must understand that the problem occurs when the extension has been compiled against an old release of Postgres, but then used with the new Postgres minor version. If the extension is recompiled using the new header files, then things are quite okay. That’s why built-in extensions that are released along with Postgres are usually unaffected by such changes.

EDB has a large portfolio of extensions. EDB Postgres Distributed (PGD) is one of the biggest. It’s an extension that gets loaded into Postgres as a shared_preload_library. PGD supports a large number of Postgres versions and flavours and comes out with periodic minor and major releases, often in sync with Postgres’s releases.

One of the major features of PGD is the ability to perform zero downtime upgrades of Postgres’s minor and major releases. This is often performed using a technique called rolling upgrades where one node in the cluster is upgraded at a time. It takes quite some effort to ensure that the rolling upgrades work correctly when new Postgres and PGD releases come out. So as part of our CI process we run upgrade tests on a cluster with some nodes still running the older version while other nodes are upgraded to the latest available release.

One such test started failing when the new Postgres packages were released. We started by looking at the ABI compatibility between the old and new versions of Postgres and the change to the ResultRelInfo struct caught our eye immediately.

$ git diff REL_17_0 src/include/nodes/execnodes.h         
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd1b16296b5..418c81f4be2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -592,6 +592,9 @@ typedef struct ResultRelInfo
         * one of its ancestors; see ExecCrossPartitionUpdateForeignKey().
         */
        List       *ri_ancestorResultRels;
+
+       /* updates do LockTuple() before oldtup read; see README.tuplock */
+       bool            ri_needLockTagTuple;
 } ResultRelInfo;
 
 /* ----------------

We knew that we make use of the struct quite liberally and it was always a possibility that the change may break something.

What did EDB and the rest of the Postgres community do?

Once it was clear that the change can cause ABI breakage, not just for us but for other Postgres extensions too, we decided to report [3] the potential problem to the community. We were cognizant of the fact that the releases are already out, but it was important to raise the alarm as soon as we could.

Once the community became aware of the potential risk, the issue was discussed at length. For a moment, it seemed we could get away with this because Postgres’s built-in memory management system always allocates memory in power of 2 and hence in this case, it was already overallocating even with older Postgres versions. But the fact that an array of this struct is exposed by the executor to the extension, made status-quo a bit dangerous. Finally, the community decided to make an out-of-schedule release to address the problem.

While the PostgreSQL project was deliberating the possible actions to mitigate the risk, the larger extension ecosystem was also trying to assess the damage and prepare a plan to ensure business continuity for their customers. This included rebuilding the extensions on top of the new packages and/or notifying their customers to not upgrade to the latest PostgreSQL versions. Of course, for some users, none of these might be acceptable workarounds (especially because the new versions also have some critical security fixes), which brings us back to the point that the PostgreSQL project must remain vigilant and cautious about ABI breakages in minor releases.

What can we do to mitigate such risks in the future?

While PostgreSQL committers are always careful to make sure that a minor release does not introduce ABI breakage, accidents can still happen. The community has revised its guidelines for the committers regarding potential changes that can cause ABI breakage. For example, earlier it was deemed reasonably safe to add new members at the end of a struct, but in the light of this incident, extra caution is warranted even in those cases. Instead, we should always try to accommodate new members within any existing padding space. If such space does not exist, then extra caution should be taken.

The community is also looking at using extra tools for catching potential ABI breakages. There are already a couple of proposals [5] in place for this and it's probably time to expedite those projects.

The extension writers also need to play their part in ensuring that the ecosystem works flawlessly. They should follow the general guidelines such as,

Only use exported functions, variables, structs etc while writing extension code
Avoid making copies of internal functions/structs and assume that they may remain stable across minor versions
Ensure that a struct allocated over stack or heap is always initialised properly.

Another important aspect is to perform continuous integration testing so that any potential breakages are detected in advance and not after the releases are out. In this case, the commit that causes ABI breakage was checked in nearly 6 weeks before the release went out. That should be more than enough time to catch such issues if the extension writers are continuously testing their software against the new code.

References

[1] PostgreSQL: Out-of-cycle release scheduled for November 21, 2024
[2] PostgreSQL: Today's Postgres Releases break login roles
[3] PostgreSQL: Potential ABI breakage in upcoming minor releases
[4] https://en.wikipedia.org/wiki/Application_binary_interface
[5] PostgreSQL: abi-compliance-checker

Resource Feature Callout 1

Explaining ABI Breakage in PostgreSQL 17.1

Pavan Deolasee

What went wrong this time?

How did we at EDB find it?

What did EDB and the rest of the Postgres community do?

What can we do to mitigate such risks in the future?

References

EDB and Red Hat : A Powerful Combination for the AI-Driven Enterprise

Embedding Python in Rust (for tests)

Logical replication in Postgres: Basics

Resource Feature Callout 1

Explaining ABI Breakage in PostgreSQL 17.1

Pavan Deolasee

What went wrong this time?

How did we at EDB find it?

What did EDB and the rest of the Postgres community do?

What can we do to mitigate such risks in the future?

References

More Blogs

More Blogs

EDB and Red Hat : A Powerful Combination for the AI-Driven Enterprise

Embedding Python in Rust (for tests)

Logical replication in Postgres: Basics