Unicode Collation Algorithm v16
The Unicode Collation Algorithm (UCA) is a specification (Unicode Technical Report #10) that defines a customizable method of collating and comparing Unicode data. Collation means how data is sorted, as with a SELECT … ORDER BY
clause. Comparison is relevant for searches that use ranges with less than, greater than, or equal to operators.
Benefits
Customizability is an important factor for various reasons:
- Unicode supports many languages. Letters that might be common to several languages might collate in different orders depending on the language.
- Characters that appear with letters in certain languages, such as accents or umlauts, have an impact on the expected collation depending on the language.
- In some languages, combinations of several consecutive characters are treated as a single character with regard to their collation sequence.
- There might be certain preferences as to collating letters according to case. For example, the lowercase form of a letter might collate before the uppercase form of the same letter or vice versa.
- There might be preferences as to whether punctuation marks like underscore characters, hyphens, or space characters are considered in the collating sequence or whether they're ignored.
Given all of these variations with the many languages supported by Unicode, a method is needed to select the specific criteria for determining a collating sequence. This is what the Unicode Collation Algorithm defines.
Note
Another advantage for using ICU collations (the implementation of the Unicode Collation Algorithm) is for performance. Sorting tasks, including B-tree index creation, can complete in less than half the time it takes with a non-ICU collation. The exact performance gain depends on your operating system version, the language of your text data, and other factors.
Basic Unicode Collation Algorithm concepts
The official information for the Unicode Collation Algorithm is specified in Unicode Technical Report #10.
The International Components for Unicode (ICU) also provides useful information. You can find an explanation of the collation concepts on the ICU website.
The basic concept behind the Unicode Collation Algorithm is the use of multilevel comparison. This means that a number of levels are defined, which are listed as level 1 through level 5 in the following bullet points. Each level defines a type of comparison. Strings are first compared using the primary level, also called level 1.
If the order can be determined based on the primary level, then the algorithm is done. If the order can't be determined based on the primary level, then the secondary level, level 2, is applied. If the order can be determined based on the secondary level, then the algorithm is done, otherwise the tertiary level is applied, and so on. There is typically a final, tie-breaking level to determine the order if it can't be resolved by the prior levels.
- Level 1 – Primary level for base characters — The order of basic characters such as letters and digits determines the difference, such as
A < B
. - Level 2 – Secondary level for accents — If there are no primary level differences, then the presence or absence of accents and other such characters determines the order, such as
a < á
. - Level 3 – Tertiary level for case — If there are no primary level or secondary level differences, then a difference in case determines the order, such as
a < A
. - Level 4 – Quaternary level for punctuation — If there are no primary, secondary, or tertiary level differences, then the presence or absence of white-space characters, control characters, and punctuation determine the order, such as
-A < A
. - Level 5 – Identical level for tie breaking — If there are no primary, secondary, tertiary, or quaternary level differences, then some other difference such as the code point values determines the order.
International components for Unicode
The Unicode Collation Algorithm is implemented by open source software provided by the International Components for Unicode (ICU). The software is a set of C/C++ and Java libraries.
When you use EDB Postgres Advanced Server to create a collation that invokes the ICU components to produce the collation, the result is referred to as an ICU collation.
Locale collations
When creating a collation for a locale, a predefined ICU short form name for the given locale is typically provided. An ICU short form is a method of specifying collation attributes, which are the properties of a collation. See Collation attributes for more information.
There are predefined ICU short forms for locales. The ICU short form for a locale incorporates the collation attribute settings typically used for the given locale. The short form simplifies the collation creation process by eliminating the need to specify the entire list of collation attributes for that locale.
The system catalog pg_catalog.pg_icu_collate_names
contains a list of the names of the ICU short forms for locales. The ICU short form name is listed in column icu_short_form
.
If needed, you can override the default characteristics of an ICU short form for a given locale by specifying the collation attributes to override that property.
Collation attributes
When creating an ICU collation, you must specify the desired characteristics of the collation. As discussed in Locale collations, you can typically do this with an ICU short form for the desired locale. However, if you need more specific information, you can specify the collation properties using collation attributes.
Collation attributes define the rules of how to compare characters for determining the collation sequence of text strings. As Unicode covers many languages in numerous variations according to country, territory, and culture, these collation attributes are complex.
For the complete and precise meaning and usage of collation attributes, see “Collator Naming Scheme” on the ICU – International Components for Unicode website.
Each collation attribute is represented by an uppercase letter. The possible valid values for each attribute are given by codes shown in the parentheses. Some codes have general meanings for all attributes. X means to set the attribute off. O means to set the attribute on. D means to set the attribute to its default value.
- A – Alternate (N, S, D) — Handles treatment of variable characters such as white spaces, punctuation marks, and symbols. When set to non-ignorable (N), differences in variable characters are treated with the same importance as differences in letters. When set to shifted (S), then differences in variable characters are of minor importance (that is, the variable character is ignored when comparing base characters).
- C – Case first (X, L, U, D) — Controls whether a lowercase letter sorts before the same uppercase letter (L), or the uppercase letter sorts before the same lowercase letter (U). You typically specify off (X) when you want lowercase first (L).
- E – Case level (X, O, D). Set in combination with the Strength attribute, use the Case Level attribute when you want to ignore accents but not case.
- F – French collation (X, O, D) — When set to on, secondary differences (presence of accents) are sorted from the back of the string as done in the French Canadian locale.
- H – Hiragana quaternary (X, O, D) — Introduces an additional level to distinguish between the Hiragana and Katakana characters for compatibility with the JIS X 4061 collation of Japanese character strings.
- N – Normalization checking (X, O, D) — Controls whether text is thoroughly normalized for comparison. Normalization deals with the issue of canonical equivalence of text whereby different code point sequences represent the same character. This occurrence then presents issues when sorting or comparing such characters — For languages such as Arabic, ancient Greek, Hebrew, Hindi, Thai, or Vietnamese, set normalization checking to on.
- S – Strength (1, 2, 3, 4, I, D) — Maximum collation level used for comparison. Influences whether accents or case are taken into account when collating or comparing strings. Each number represents a level. A setting of I represents identical strength (that is, level 5).
- T – Variable top (hexadecimal digits) — Applies only when the Alternate attribute isn't set to non-ignorable (N). The hexadecimal digits specify the highest character sequence to consider ignorable. For example, if white space is ignorable but visible variable characters aren't, then set Variable Top to 0020 along with the Alternate attribute set to S and the Strength attribute set to 3. (The space character is hexadecimal 0020. Other nonvisible variable characters, such as backspace, tab, line feed, and carriage return, have values less than 0020. All visible punctuation marks have values greater than 0020.)
A set of collation attributes and their values is represented by a text string consisting of the collation attribute letter concatenated with the desired attribute value. Each attribute/value pair is joined to the next pair with an underscore character:
AN_CX_EX_FX_HX_NO_S3
You can specify collation attributes with a locale’s ICU short form name to override those default attribute settings of the locale.
In this example, the ICU short form named LROOT
is modified with a number of other collation attribute/value pairs:
AN_CX_EX_LROOT_NO_S3
The Alternate attribute (A
) is set to non-ignorable (N
). The Case First attribute (C
) is set to off (X
). The Case Level attribute (E
) is set to off (X
). The Normalization attribute (N
) is set to on (O
). The Strength attribute (S
) is set to the tertiary level 3
. LROOT
is the ICU short form to which these other attributes are applying modifications.
Using a collation
You can use a newly defined ICU collation anywhere you can use the COLLATION "collation_name"
clause in a SQL command. For example, you can use it in the column specifications of the CREATE TABLE
command or appended to an expression in the ORDER BY
clause of a SELECT
command.
The following are some examples of creating and using ICU collations based on the English language in the United States (en_US.UTF8
). In these examples, ICU collations are created with the following characteristics:
Collation
icu_collate_lowercase
forces the lowercase form of a letter to sort before its uppercase counterpart (CL
).Collation
icu_collate_uppercase
forces the uppercase form of a letter to sort before its lowercase counterpart (CU
).Collation
icu_collate_ignore_punct
causes variable characters (white space and punctuation marks) to be ignored during sorting (AS
).Collation
icu_collate_ignore_white_sp
causes white space and other nonvisible variable characters to be ignored during sorting, but visible variable characters (punctuation marks) aren't ignored (AS
,T0020
).
Note
When creating collations, ICU might generate notice and warning messages when attributes are given to modify the LROOT
collation.
The following psql
command lists the collations:
The following table is created and populated:
Using the default collation
The following query sorts on column c2
using the default collation. Variable characters (white space and punctuation marks) with id
column values of 9
, 10
, and 11
are ignored and sort with the letter B
.
Using collation icu_collate_lowercase
The following query sorts on column c2
using collation icu_collate_lowercase
. This collation forces the lowercase form of a letter to sort before the uppercase form of the same base letter. The AN
attribute forces the sort order to include variable characters at the same level when comparing base characters. Thus rows with id
values of 9
, 10
, and 11
appear at the beginning of the sort list before all letters and numbers.
Using collation icu_collate_uppercase
The following query sorts on column c2
using collation icu_collate_uppercase
. This collations forces the uppercase form of a letter to sort before the lowercase form of the same base letter.
Using collation icu_collate_ignore_punct
The following query sorts on column c2
using collation icu_collate_ignore_punct
. This collation causes variable characters to be ignored so rows with id values of 9
, 10
, and 11
sort with the letter B
, the character immediately following the ignored variable character.
Using collation icu_collate_ignore_white_sp
The following query sorts on column c2
using collation icu_collate_ignore_white_sp
. The AS
and T0020
attributes of the collation cause variable characters with code points less than or equal to hexadecimal 0020
to be ignored while variable characters with code points greater than hexadecimal 0020
are included in the sort.
The row with id
value of 11
, which starts with a space character (hexadecimal 0020
), sorts with the letter B
. The rows with id
values of 9
and 10
, which start with visible punctuation marks greater than hexadecimal 0020
, appear at the beginning of the sort list. These particular variable characters are included in the sort order at the same level when comparing base characters.