Unicode equivalence
Unicode is a complicated beast. One of its numerous peculiar features is that different sequences of codepoints can be equal. This is not the case in legacy encodings. In LATIN1, for instance, the only thing that is equal to ‘a’ is ‘a’ and the only thing that is equal to ‘ä’ is ‘ä’. In Unicode, however, characters with diacritic marks can often (depending on the particular character) be encoded in different ways: either as a precomposed character, as was done in legacy encodings such as LATIN1, or decomposed, consisting of the base character ‘a’ followed by the diacritic mark ◌̈ here. This is called canonical equivalence. The advantage of having both of these options is that you can, on the one hand, easily convert characters from legacy encodings and, on the other hand, don’t need to add every accent combination to Unicode as a separate character. But this scheme makes things more difficult for software using Unicode.
As long as you are just looking at the resulting character, such as in a browser, you shouldn’t notice a difference and this doesn’t matter to you. However, in a database system where searching and sorting strings is fundamental and performance-critical functionality, things can get complicated.
First, the collation library in use needs to be aware of this. However, most system C libraries including glibc are not. So in glibc, when you look for ‘ä’, you won’t find ‘ä’. See what I did there? The second is encoded differently but probably looks the same to you reading. (At least that is how I entered it. It might have been changed somewhere along the way to your browser.) Confusing. If you use ICU for collations, then this works and is fully supported.
Second, when PostgreSQL compares strings for equality, it just compares the bytes, it does not take into consideration the possibility that the same string can be represented in different ways. This is technically wrong when using Unicode, but it’s a necessary performance optimization. To work around that, you can use nondeterministic collations, a feature introduced in PostgreSQL 12. A collation declared that way will not just compare the bytes
but will do any necessary preprocessing to be able to compare or hash strings that might be encoded in different ways. Example:
CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
Normalization forms
So while there are different valid ways to encode certain Unicode characters, it is sometimes useful to convert them all to a consistent form. This is called normalization. There are two normalization forms: fully composed, meaning we convert all codepoint sequences to precomposed characters as much as possible, and fully decomposed, meaning we convert all codepoints to their component pieces (letter plus accent) as much as possible. In Unicode terminology, these forms are known as NFC and NFD, respectively. There are some more details to this, such as putting all the combining characters into a canonical order, but that’s the general idea. The point is, when you convert a Unicode string into one of the normalization forms, then you can compare or hash them bytewise without having to worry about encoding variants. Which one you use doesn’t matter, as long as the whole system agrees on one.
In practice, most of the world uses NFC. And moreover, many systems are faulty in that they don’t handle non-NFC Unicode correctly, including most C libraries’ collation facilities, and even PostgreSQL by default, as mentioned above. So ensuring that all Unicode is converted to NFC is a good way to ensure better interoperability.
Normalization in PostgreSQL
PostgreSQL 13 now contains two new facilities to deal with Unicode normalization: a function to test for normalization, and one to convert to a normalization form. For example:
SELECT 'foo' IS NFC NORMALIZED;
SELECT 'foo' IS NFD NORMALIZED;
SELECT 'foo' IS NORMALIZED; -- NFC is the default
SELECT NORMALIZE('foo', NFC);
SELECT NORMALIZE('foo', NFD);
SELECT NORMALIZE('foo'); -- NFC is the default
(The syntax is specified in the SQL standard.)
One option is to use this in a domain, for example:
CREATE DOMAIN norm_text AS text CHECK (VALUE IS NORMALIZED);
Note that normalization of arbitrary text is not entirely cheap. So apply this sensibly and only where it really matters.
Note also that normalization is not closed under concatenation. That means that appending two normalized strings does not always result in a normalized string. So even if you carefully apply these functions and also otherwise check that your system only uses normalized strings, they can still "creep in" during legitimate operations. So just assuming that non-normalized strings cannot happen will fail; this issue has to be dealt with properly.
Compatibility characters
There is another use case for normalization. Unicode contains some alternative forms of letters and other characters, for various legacy and compatibility purposes. For example, you can write Fraktur:
SELECT '????????';
Now imagine your application assigns user names or other such identifiers, and there is a user named 'somename'
and another one named '????????'
. This would at least be confusing, but possibly a security risk. Exploiting such similarities is often used in phishing attacks, fake URLs, and similar concerns. So Unicode contains two additional normalization forms that resolve these similarities and convert such alternative forms into a canonical base letter. These forms are called NFKC and NFKD. They are otherwise the same as NFC and NFD, respectively. For example:
=> select normalize('????????', nfkc);
normalize
-----------
somename
Again, using check constraints perhaps as part of a domain can be useful:
CREATE DOMAIN username AS text CHECK (VALUE IS NFKC NORMALIZED OR VALUE IS NFKD NORMALIZED);
(The actual normalization should probably be done in the user interface frontend.)
See also RFC 3454 for a treatment of strings to address such concerns.
Summary
Unicode equivalence issues are often ignored without consequence. In many contexts, most data is in NFC form, so no issues arise. However, ignoring these issues can lead to strange behavior, apparently missing data, and in some situations security risks. So awareness of these issues is important for database designers, and the tools described in this article can be used to deal with them.