Unicode

Goals

Be familiar with Unicode character categories.
Understand character composition.
Be aware of Java character surrogate pairs.
Know how to normalize character sequence.
Realize the importance of using a collator for internationalization string comparison.

Concepts

base character
Basic Multilingual Plane (BMP)
Best Current Practice (BCP)
bidi
canonical
character
character composition
collation
collator
collator strength
diacritic
glyph
high surrogate
International Components for Unicode (ICU)
Internet Engineering Task Force (IETF)
low surrogate
mark
normalization
precomposed character
render
supplementary character
surrogate pairs
Universal Coded Character Set (UCS)
World Wide Web Consortium (W3C)

Library

Lesson

You learned in a previous lesson that a major part of internationalization is separating resources from code, placing display strings and related information into resource files to be localized for different locales. Internationalization however does not end at being able to substitute different strings for different languages. The processing of those strings require special cognizance of variations in writing systems and writing direction. You can still run into problems, even using Unicode string resources, if you assume that the strings follow the rules of American English. Many assumptions of a native English-speaking developer do not hold up when applied to other locales, as shown by the following figure.

English-speaking assumptions do not hold globally. Adapted from Java™ Internationalization by Andrew Deitsch and David Czarnecki (O'Reilly, 2001).

Alphabet: The letters A-Z do not always contain all the letters in the alphabet. Languages such as Danish have letters that appear after Z in the alphabet, and languages such as Hindi don't use the Latin script at all.
Vowels: A script does not always require both consonants and vowels. In Urdu consonants are optional, while in Hindi vowels may change shape based upon position or be written above or below letters.
Upper/Lowercase: Some scripts do not have the concept of uppercase. Hebrew letter are always written the same way, for example.
Punctuation: Punctuation symbols differ in other languages. Spanish adds an inverted question mark ¿ at the beginning of a question, and in Greek questions end with a semicolon ; character.
Direction: Not all scripts are written left-to-right. Arabic and Hebrew are written right-to-left, while Chinese can be written vertically from right to left.
Sorting: Languages using the same script may not always sort the same way. French for example uses a multilevel sorting using first base characters and then special marks.

The Unicode Consortium has intensely studied the various writing systems and produced rules for working with Unicode code points to allow for the variations in those systems. Studying those rules first requires attention on the Java's own representation of the Unicode character.

In June 2016 the Unicode Consortium release version 9.0 of The Unicode Standard, adding 7,500 characters; from letters of the Osage Native American Language, to emoji such as the crossed fingers hand with index and middle fingers crossed

character U+1F91E. See Announcing The Unicode® Standard, Version 9.0.

Characters

You were introduced to the concepts of characters in an earlier lesson on charsets. Each Unicode code point in the Universal Coded Character Set (UCS) represents a specific character, which is a distinct semantic entity without regard to the style in which it is presented. You also learned that Unicode characters with code points in the range 0x0000 – 0xFFFF make up the the Basic Multilingual Plane (BMP). Codes beyond this range are for supplementary characters.

You cannot see a character itself; it is only a concept. A stylistic representation of a character is called a glyph. The presentations E and E are two different glyphs representing the character named LATIN CAPITAL LETTER E. To display a glyph representation, a computer will render the character in some font and style.

In Java a primitive char can be wrapped in a java.lang.Character instance, but the Character class provides also represents the Unicode concept of a character, and provides access to much useful information about the Unicode character definitions. In this lesson Character is used to indicate the Java class, while character is used to mean a Unicode character in general.

Because the Java char type is 16 bits and therefore cannot represent character code points that fall outside the BMP, the Java Character class and related APIs often accept an int for methods that refer to Unicode code points in general. The method Character.isValidCodePoint(int codePoint) for example uses an int to accept code point values far beyond the BMP range. When there exist two such methods for retrieving information about a character, this lesson will reference the version that accepts an int code point parameter.

Character Properties

The Unicode Standard provides definitions all UCS characters in the Unicode Character Database (UCD), including extensive property lists and supplementary information, which it distributes as a series of text files accessible to computer processing. See Unicode Character Database for access to the UCD itself, and UAX #44: Unicode Character Database for a specification of UCD file formats.

LINEAR B SYLLABLE B080 MA — Excerpt from the `UnicodeData.txt` file distributed with the UCD.

		Name	General Category	Canonical Combining Class	Bidi Class	Decomposition Type / Mapping	Numeric Type / Value	Bidi Mirrored	Simple Uppercase Mapping	Simple Lowercase Mapping	Simple Titlecase Mapping
`U+0020`		`SPACE`	`Zs`	`0`	`WS`			`N`
`U+0021`	!	`EXCLAMATION MARK`	`Po`	`0`	`ON`			`N`
`U+0024`	$	`DOLLAR SIGN`	`Sc`	`0`	`ET`			`N`
`U+0033`	3	`DIGIT THREE`	`Nd`	`0`	`EN`		`3`	`N`
`U+0045`	E	`LATIN CAPITAL LETTER E`	`Lu`	`0`	`L`			`N`		`0065`
`U+0065`	e	`LATIN SMALL LETTER E`	`Ll`	`0`	`L`			`N`	`0045`		`0045`
`U+007B`	{	`LEFT CURLY BRACKET`	`Ps`	`0`	`ON`			`Y`
`U+007D`	}	`RIGHT CURLY BRACKET`	`Pe`	`0`	`ON`			`Y`
`U+00C9`	É	`LATIN CAPITAL LETTER E WITH ACUTE`	`Lu`	`0`	`L`	`0045 0301`		`N`		`00E9`
`U+00E9`	é	`LATIN SMALL LETTER E WITH ACUTE`	`Ll`	`0`	`L`	`0065 0301`		`N`	`00C9`		`00C9`
`U+0301`	́	`COMBINING ACUTE ACCENT`	`Mn`	`230`	`NSM`			`N`
`U+0628`	ب	`ARABIC LETTER BEH`	`Lo`	`0`	`AL`			`N`
`U+0663`	٣	`ARABIC-INDIC DIGIT THREE`	`Nd`	`0`	`AN`		3	`N`
`U+092E`	म	`DEVANAGARI LETTER MA`	`Lo`	`0`	`L`			`N`
`U+10014`		`LINEAR B SYLLABLE B080 MA`	`Lo`	`0`	`L`			`N`
`U+1F602`	😂	`FACE WITH TEARS OF JOY`	`So`	`0`	`ON`			`N`

The official UCS name of a character can be retrieved using Character.getName(int codePoint).

Character Categories

UCD character categories. (UAX #44 § 5.7.1)

Code	Name	Description
`Lu`	`Uppercase_Letter`	an uppercase letter
`Ll`	`Lowercase_Letter`	a lowercase letter
`Lt`	`Titlecase_Letter`	a digraphic character, with first part uppercase
`LC`	`Cased_Letter`	`Lu` \| `Ll` \| `Lt`
`Lm`	`Modifier_Letter`	a modifier letter
`Lo`	`Other_Letter`	other letters, including syllables and ideographs
`L`	`Letter`	`Lu` \| `Ll` \| `Lt` \| `Lm` \| `Lo`
`Mn`	`Nonspacing_Mark`	a nonspacing combining mark (zero advance width)
`Mc`	`Spacing_Mark`	a spacing combining mark (positive advance width)
`Me`	`Enclosing_Mark`	an enclosing combining mark
`M`	`Mark`	`Mn` \| `Mc` \| `Me`
`Nd`	`Decimal_Number`	a decimal digit
`Nl`	`Letter_Number`	a letterlike numeric character
`No`	`Other_Number`	a numeric character of other type
`N`	`Number`	`Nd` \| `Nl` \| `No`
`Pc`	`Connector_Punctuation`	a connecting punctuation mark, like a tie
`Pd`	`Dash_Punctuation`	a dash or hyphen punctuation mark
`Ps`	`Open_Punctuation`	an opening punctuation mark (of a pair)
`Pe`	`Close_Punctuation`	a closing punctuation mark (of a pair)
`Pi`	`Initial_Punctuation`	an initial quotation mark
`Pf`	`Final_Punctuation`	a final quotation mark
`Po`	`Other_Punctuation`	a punctuation mark of other type
`P`	`Punctuation`	`Pc` \| `Pd` \| `Ps` \| `Pe` \| `Pi` \| `Pf` \| `Po`
`Sm`	`Math_Symbol`	a symbol of mathematical use
`Sc`	`Currency_Symbol`	a currency sign
`Sk`	`Modifier_Symbol`	a non-letterlike modifier symbol
`So`	`Other_Symbol`	a symbol of other type
`S`	`Symbol`	`Sm` \| `Sc` \| `Sk` \| `So`
`Zs`	`Space_Separator`	a space character (of various non-zero widths)
`Zl`	`Line_Separator`	U+2028 LINE SEPARATOR only
`Zp`	`Paragraph_Separator`	U+2029 PARAGRAPH SEPARATOR only
`Z`	`Separator`	`Zs` \| `Zl` \| `Zp`
`Cc`	`Control`	a C0 or C1 control code
`Cf`	`Format`	a format control character
`Cs`	`Surrogate`	a surrogate code point
`Co`	`Private_Use`	a private-use character
`Cn`	`Unassigned`	a reserved unassigned code point or a noncharacter
C	`Other`	`Cc` \| `Cf` \| `Cs` \| `Co` \| `Cn`

Emphasized rows represent category groups.

Each character is assigned a general category, which is important for determining how characters relate to other characters. A character may be considered e.g. a lowercase or uppercase letter; a number; or punctuation. The general category column in the above table contains codes specifying the category shown in the figure to the side. Unicode also defines category groups that are shorthands for several categories together, but these are not used in the definitions of the individual characters.

The space character U+0020 is a Space_Separator (Zs).
The exclamation mark ! character U+0021 is Other_Punctuation (Po); while the left curly bracket { character U+007B and right curly bracket { character U+007D are considered Open_Punctuation (Ps) and Close_Punctuation (Pe), respectively,
The dollar sign $ character U+0024 is a Currency_Symbol (Sc).
The digit 3 U+0033 and Arabic digit ٣ U+0663 are considered Decimal_Number (Nd).
The characters E U+0045 and É U+00C9 are Uppercase_Letter (Lu), while the characters e U+0065, and é U+00E9 are Lowercase_Letter (Ll).
The letters ب U+0628, म U+092E, and U+10014 are considered Other_Letter (Lo); as Arabic, Devanagari, and Linear B respectively have no concept of uppercase or lowercase.
The combining acute accent character U+0301 is considered a Nonspacing_Mark (Mn), the significance of which will be explained below under Composition.

Java refers to a character's general category as its type, which can be retrieved using Character.getType(int codePoint). The return value is a string representing one of the general categories in the categories table in the side figure. Rather than querying the type generally, the Character class has several convenience methods for querying specific categories. The convenience methods for retrieving the category groups is especially handy, as they encompass several types.

Character.isAlphabetic(int codePoint): Uppercase_Letter (Lu), Lowercase_Letter(Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), Other_Letter (Lo), Letter_Number (Nl), or contributory property Other_Alphabetic Equivalent to group Letter (L) with the addition of Letter_Number (Nl).
Character.isDigit(int codePoint): Decimal_Number (Nd).
Character.isLetter(int codePoint): Uppercase_Letter (Lu), Lowercase_Letter(Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), or Other_Letter (Lo). Equivalent to group Letter (L).
Character.isLetterOrDigit(int codePoint): Uppercase_Letter (Lu), Lowercase_Letter(Ll), Titlecase_Letter (Lt), Modifier_Letter (Lm), Other_Letter (Lo), or Decimal_Number (Nd). Equivalent to group Letter (L) with the addition of Decimal_Number (Nd).
Character.isLowerCase(int codePoint): Lowercase_Letter (Ll) or contributory property Other_Lowercase.
Character.isMirrored(int codePoint): Mirrored characters such as categories Open_Punctuation (Ps) and Close_Punctuation (Pe).
Character.isSpaceChar(int codePoint): Space_Separator (Zs), Line_Separator (Zl), or Paragraph_Separator (Zp). Equivalent to group Separator (Z).
Character.isTitleCase(int codePoint): Titlecase_Letter (Lt).
Character.isUpperCase(int codePoint): Uppercase_Letter (Lu) or contributory property Other_Uppercase.

Bidi Properties

Some scripts such as Arabic and Hebrew are written from right-to-left rather than left-to-right. Several properties, such as Bidi Class and Bidi Mirrored in the table above, provide extensive information on the bidirectional or bidi characteristics of each character. This information is used when rendering a string of characters to actually display them to the user, which uses a complex set of rules described in AUX #9: Unicode Bidirectional Algorithm. Storing and retrieving internationalization strings can be done in large part without regard to bidi-related issues, which may be left to the rendering code that actually prints or displays the text.

Composition

The ISO-8859-1 charset as you saw in an earlier lesson increased the ASCII repertoire mostly by adding many accented letters found in European languages, such as the é U+00E9 that appears in touché. Many languages that use these accented characters do not consider them to be different letters as such. Rather the accent or diacritic (from diacritical mark) is added to a letter to ensure that its sound remains the same in the presence of other letters. (This is similar to how English adds a letter e to the end of words such as cane so that the vowel sound of can is changed). A diacritic is one of several types of marks, as indicated by the category Mark (M) in the character category table above.

The UCS provides separate diacritics and other marks in the category Nonspacing_Mark, which may be placed after a base character with the understanding that the base character and the mark(s) will be shown to the user as one character. Rather storing a single code point U+00E9 representing é, you could store the code point U+0065 representing the base character e, followed by the code point U+0301 for the combining acute accent mark. Through character composition this will create what appears to be the same character: é. The letter é U+00E9 is called a precomposed character because its form is already composed of the base character U+0065 and the combining mark U+0301.

As shown in the top table, the information in UnicodeData.txt indicates in the Decomposition Type / Mapping column that character U+00E9 is composed of code points U+0065 and U+0301.

Many representations have no precomposed forms. The Hindi word for the first person singular pronoun (I), for example, is मैं. This word is composed of the base character म U+092E followed by the vowel sign ै U+0948 (in the Nonspacing_Mark category), followed by the anusvāra diacritic ं U+0902 (also a Nonspacing_Mark) to indicate nasalized pronunciation.

Because some letters such as é U+00E9 have both precomposed and separate forms, comparing strings can be very tricky!! See the Normalization section below for solutions.

Character Sequences

Many times in Java you will need to deal with more than just single characters. As you saw above, at times what may appear as a single glyph is in reality stored as a combination of a base character followed by combining mark characters. The character sequence you are most familiar with is java.lang.String.

CharSequence

If you are simply dealing with a sequence of characters it is therefore better to make your methods general and use CharSequence instead of a particular implementation such as String.

A more general interface for dealing with sequences of characters is java.lang.CharSequence. It provides many of the same char access methods you may already have noticed in String, such as CharSequence.length() and CharSequence.charAt(int index). Both String and java.lang.StringBuilder implement the CharSequence interface.

Counting occurrences of 'E' and 'e' in a general CharSequence.

/** Counts the number of times the characters 'E' and 'e'
 * appear in the given sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of times 'E' or 'e' occurs in the sequence.
 */
public static int countE(@Nonnull final CharSequence charSequence) {
  int count = 0;
  for(int i = 0; i < charSequence.length(); i++) {
    final char c = charSequence.charAt(i);
    if(c == 'E' || c == 'e') {
      count++;
    }
  }
  return count;
}

What would this method return if you were to call countE("touché")? Would it matter if the string "touché" used the precomposed character 'é' U+00E9 or the decomposed form using U+0065 and U+0301? (See Composition above.) In other words, would countE(new SringBuilder("touch").append('\u0065').append('\u0301')) return a different result? Should it? (See Normalization below.)

CharSequence can also provide its char values as an IntStream via the CharSequence.chars() method. Counting occurrences of 'E' / 'e' in a could be accomplished much more compactly (and potentially more efficiently) using stream filtering and counting.

Counting occurrences of 'E' and 'e' using CharSequence.chars().

/** Counts the number of times the characters 'E' and 'e'
 * appear in the given sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of times 'E' or 'e' occurs in the sequence.
 */
public static int countE(@Nonnull final CharSequence charSequence) {
  return (int)charSequence.chars()
      .filter(c -> c == 'E' || c == 'e')
      .count();
}

Even though CharSequence.chars() returns an IntStream, each individual value represents a char in the original CharSequence. The returned stream would never contain the value 0x10014, for example, as this is larger than a char. See Surrogate Pairs below.

If this method only returns widened char values, You might wonder why CharSequence.chars() doesn't return a CharStream instead of an IntStream. There is a simple reason: Java does not have any such CharStream class. See Why is String.chars() a stream of ints in Java 8?.

Another useful method is CharSequence.subSequence(int start, int end), which functions similarly to collection views such as java.util.List.subList(int fromIndex, int toIndex). The end index is exclusive. Thus calling "touché".subSequence(1, 5) would yield the string "ouch".

Surrogate Pairs

Some UCS code points such as the Linear B symbol U+10014 in the table above are outside the BMP and require more than the 16 bits available in a single char. When working with the Character class, we got around this limitation by using a 32-bit int as a parameter in methods such as Character.isValidCodePoint(int codePoint). CharSequence implementations such as String, however, are conceptually are made up of char values. To understand how Java provides access to supplementary characters, you must first understand some of the details of UTF-16 encoding.

When discussing charsets you learned that both UTF-8 and UTF-16 are variable-length encodings, and both cover the the entire range of UCS code points. UTF-8 can use one, two, three, or even more bytes to encode a single code point using an involved algorithm. UTF-16 on the other hand uses a simpler approach. For any code points between 0x0000 and 0xFFFF, UTF-16 will use a single 16-bit value to represent the character. For supplementary characters (U+10000 and above) UTF-16 will use two subsequent 16-bit values called a surrogate pair. The first value or high surrogate will be in the range 0xD800 – 0xDBFF, while the second value or low surrogate will be in the range 0xDC00 – 0xDFFF. This encoding technique only works because there exist no UCS characters that use code points in the range 0xD800 – 0xDFFF.

A surrogate pair is a concept that belongs solely to the UTF-16 encoding, even though the UCS makes allowances for it by reserving values 0xD800 – 0xDFFF. Because values within this range are not and will never represent Unicode characters, a decoded sequence of UTF-8 bytes should never produce a high or low surrogate and in fact surrogate pairs in UTF-8 are prohibited by RFC 3629. Nevertheless certain systems incorrectly encode high and low surrogates in UTF-8 sequences, such as in Windows filenames using certain API calls. See The WTF-8 encoding.

Java uses UTF-16 in CharSequence and its implementations. This has a startling implication: not every char value represents a Unicode code point. Every char in a CharSequence such as String is potentially part of a surrogate pair, which must be decoded to discover the UCS code point! To discover if a char value is part of a surrogate pair, use Character.isSurrogate(char ch). Because surrogate pairs come in a certain order, you can use Character.isHighSurrogate(char ch) to determine if a surrogate pair is starting, followed by Character.isLowSurrogate(char ch) for the subsequent character of any encountered pair. After detecting a surrogate pair, you can determine the encoded Unicode code point using Character.toCodePoint(char high, char low).

If your program logic looks at the individual char values of a CharSequence and assumes they each represent a Unicode character, your program is incorrect! Only if your particular analysis does not apply to supplementary characters, such as the search for 'E' and 'e' in the CharSequence example above, can you safely ignore surrogate pairs.

As an example of how not to process a character sequence, the following code naively looks at each char without regard to whether a surrogate pair is present when counting characters in the Letter (L) category.

Incorrectly counting the number of letters in a CharSequence.

/** Counts the number letters in a sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of Unicode letter characters.
 * @see @see Character#isLetter(char)
 */
public static int countLetters(@Nonnull final CharSequence charSequence) {
  int count = 0;
  for(int i = 0; i < charSequence.length(); i++) {
    final char c = charSequence.charAt(i);
    if(Character.isLetter(c)) {  //WRONG; may be surrogate pair
      count++;
    }
  }
  return count;
}

Unfortunately the above implementation would skip the Linear B LINEAR B SYLLABLE B080 MA symbol U+10014 altogether. Even though Java recognizes it as being in the Unicode category Other_Letter (Lo), it is encoded in the string as the UTF-16 surrogate pair 0xD800 and 0xDC14. Neither of these char values on its own represents a letter. To correct the algorithm, one would need to use the Character surrogate pair detection and conversion methods explained above, as illustrated in the figure below.

Correctly counting the number of letters in a CharSequence.

/** Counts the number letters in a sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of Unicode letter characters.
 * @see @see Character#isLetter(char)
 */
public static int countLetters(@Nonnull final CharSequence charSequence) {
  final int length = charSequence.length();
  int count = 0;
  for(int i = 0; i < length; i++) {
    //store char as int in case we need to decode surrogate
    int c = charSequence.charAt(i);
    //detect surrogate pairs
    if(Character.isHighSurrogate(c) && i + 1 < length) {
      final int nextChar = charSequence.charAt(i+1);
      if(Character.isLowSurrogate(nextChar)) {
        c = Character.toCodePoint(c, nextChar);
        i++; //skip the low surrogate the next time through the loop
      }
    }
    //check the character using the int character value
    if(Character.isLetter(c)) {
      count++;
    }
  }
  return count;
}

Besides CharSequence.chars() there is another method that returns a stream: CharSequence.codePoints(). Although both methods return an IntStream, CharSequence.codePoints() returns a sequence of Unicode code point values instead of char values. Unlike CharSequence.chars(), CharSequence.codePoints() will never return a surrogate pair, and you will never have to check for them! Just be careful not to cast the values returned by CharSequence.codePoints() to char.

Using CharSequence.codePoints() the letter-counting algorithm above could be made must more compact and readable.

Counting the number of letters in a CharSequence using the IntStream from CharSequence.codePoints().

/** Counts the number letters in a sequence of characters.
 * @param charSequence The sequence of characters to examine.
 * @return The number of Unicode letter characters.
 * @see @see Character#isLetter(char)
 */
public static int countLetters(@Nonnull final CharSequence charSequence) {
  return (int)charSequence.codePoints()
      .filter(Character::isLetter)
      .count();
}

There is hardly ever a reason to use CharSequence.chars() when analyzing the contents of a CharSequence, and this can give you incorrect data if you forget to check for surrogates. Use CharSequence.codePoints() instead, as this obviate the need for surrogate checking. If you need to store the resulting code point in another CharSequence, use methods that support code points such as StringBuilder.appendCodePoint(int codePoint), which will append the correct sequence of surrogates as necessary.

You've seen that because the char type is limited to 16 bits, it cannot represent every Unicode code point. Consequently Java CharSequence has to expose the underlying UTF-16 encoding to represent characters outside the BMP. More and more methods, such as CharSequence.codePoints() and Character.isLetter(int codePoint), use int rather than char to represent Unicode characters. There is even an IntStream class but no CharStream. Is it time for Java to deprecate char and to stop using char altogether? Should Java change the size of char to 32 bits, breaking backwards compatibility? Should Java come out with a codepoint primitive type, or should developers simply learn to work with int when processing characters?

Normalization

You saw above that some characters such as é U+00E9 have both precomposed forms, as well as separate forms with accents and other characters stored as distinct code points. This can cause major problems with even simple operations such as searching for a character or comparing strings. You surely would not want to separate searches for U+00E9 and for the sequence U+0065 U+0301, just to cover all composition forms of é. Imagine searching for letters that have decomposed forms consisting of three or more characters! Trying to check for all representations would furthermore become a nightmare when trying to compare strings.

Unicode Normalization Forms. (UAX #15 § 1.2)

Code	Name	Description
`NFD`	`Normalization Form D`	`Canonical Decomposition`
`NFC`	`Normalization Form C`	`Canonical Decomposition`, followed by `Canonical Composition`
`NFKD`	`Normalization Form KD`	`Compatibility Decomposition`
`NFKC`	`Normalization Form KC`	`Compatibility Decomposition`, followed by `Canonical Composition`

Unicode's solution to the problem of precomposed characters is normalization, the process of converting characters to some normal form. Here normal simply means some common, agreed upon representation; and UAX #15: Unicode Normalization Forms defines four such normalization forms, displayed in the accompanying figure. Normalization occurs by some sequence of decomposing the precomposed characters and optionally composing them into precomposed forms (whether they started that way or not). The most important of these forms use canonical decomposition/composition, using the forms preferred by The Unicode Standard.

Two strings normalized to the same normalization form can be safely compared, because the composition of each character is guaranteed to be the same by the algorithm. Which form you choose depends on your needs. If you wanted simply to compare two strings, you could use form NFD and decompose all the characters to their canonical decomposed forms, with no need to put them back into precomposed forms. If you were searching for a character you knew was in a canonical precomposed form, such as é U+00E9, you might choose form NFC to also convert the characters into their canonical precomposed form for easy matching of the single code point.

Java provides the class java.text.Normalizer which implements the normalization algorithm in UAX #15, representing the various forms using the java.text.Normalizer.Form enum. The following example shows how to search for the canonical precomposed form of é U+00E9 in a string that initially used the decomposed form e U+0065 followed by the combining acute accent ́ U+0301, by first normalizing the string using Normalization Form C (NFC).

Normalizing a string before searching for a character.

String text = "touch\u0065\u0301";  //"touché" in decomposed form
int resultIndex = text.indexOf('é'); //returns -1; character not found
text = Normalizer.normalize(text, Normalizer.Form.NFC);
int resultIndex = text.indexOf('é'); //returns correct index 5

The World Wide Web Consortium (W3C) recommends using Normalization Form C (NFC) when publishing text on the Web. See Normalization in HTML and CSS and What is the difference is between W3C normalization and Unicode normalization?

Collation

The Java Comparator.compare(T o1, T o2) method, the sorting strategy you've used in several lessons, allows any two objects to be compared. The trick is deciding how a certain type of object should be ordered. Sorting numbers could hardly be simpler: simply compare the values arithmetically. It would be a mistake to think that sorting Unicode code points were that simple; human languages have evolved haphazardly for thousands of years, and the result is that sorting text or collation is replete with complications, confusions, and contradictions. Sor

Many a developer has naively tried to sort strings by comparing the Unicode code point value of individual characters, as in the following example:

Incorrectly collating character sequences by comparing code points.

public static final Comparator<CharSequence> textComparator = (text1, text2) -> {
  final int length = Math.min(text1.length(), text2.length());
  int result = 0;
  int index = 0;
  while(result == 0 && index < length) {
    //WRONG! Do NOT do this.
    result = Integer.compare(text1.charAt(index), text2.charAt(index));
    index++;
  }
  if(result == 0) { //handle one string being a prefix of the other
    result = Integer.compare(text1.length(), text2.length());
  }
  return result;
};

For comparing simple words in English using all lowercase letters such as cat and car, this algorithm works! But in real life this approach quickly runs into trouble. You may want to refer to the charts in the lesson on charsets.

Case: From the ASCII table, the uppercase letters A U+0041 – Z U+00FA appear separately and before the lowercase letters a U+0061 – z U+007A. This means that Zanzibar with a capital letter would be sorted before apple!
Diacritics: From the ISO-8859-1 table, the accented letter é U+00E9 appears after all the unaccented letters, both upper and lowercase, so that touching would be sorted before touché!
Composition: The accented letter é U+00E9 may be stored in decomposed form as e U+0065 followed by the accent mark ́ U+0301. The sorting would change based on whether the precomposed form was used. The combining mark(s) would also cause, so that touched would appear before touché stored in decomposed form.

You might have guessed that getting around the composition problem might involve some form of normalization. Thinking further, you might realize that accents could be ignored if the sequence were first decomposed, perhaps using Normalization Form D (NFD), and then removing the decomposed accent characters before sorting. To ignore differences in case, you could have some way to map the uppercase characters to lowercase characters before sorting.

In fact Unicode provides just such a mapping! Looking at the UCD table at the beginning of this lesson, you'll see for example that the character E U+0045 has a simple lowercase mapping of U+0065, the code point for e. In addition to mapping case, however, you have to decide if the original case should be considered, as in some contexts the user would still expect an uppercase version of a letter to go before (or after) the lowercase version.

Rather than doing all this work manually, however, you should use the collation tools Java puts at your disposal in the form of a collator.

Unicode comparison levels. (UTS #10 § 1.1)

Level	Description	Example
`L1`	Base characters	role < roles < rule
`L2`	Accents	role < rôle < roles
`L3`	Case / Variants	role < Role < rôle
`L4`	Punctuation	role < “role” < Role
`Ln`	Identical	role < ro□le < “role”

Collators

Unicode provides UTS #10: Unicode Collation Algorithm to specify the steps to take when sorting sequences of characters. Because ordering expectations may differ based upon whether words are appearing in a dictionary or in a user list, for example, UTS #10 prescribes a multilevel comparison algorithm. The table lists these Comparison Levels to chose from when sorting text according to Unicode rules.

The the Unicode collation algorithm is difficult; UTS #10 is dense and complicated. To simplify things somewhat Java provides the java.text.Collator class. A Collator is a comparator that knows how to normalize strings and then compare them taking into account accents and case as necessary and appropriate for different locales. You can get a collator for the current locale using Collator.getInstance(), or for a particular locale using Collator.getInstance(Locale desiredLocale). The Collator.compare(String source, String target) method is used as you would for a Comparator<String>.

There also exists a Collator.compare(Object o1, Object o2) method, because Collator is in fact a Comparator<Object>. Nevertheless a Collator can only compare String instances, and not general objects or even general CharSequence instances. (Presumably this approach was implemented before generics were available in the language.) The International Components for Unicode (ICU) library ICU4J provides a com.ibm.icu.text.Collator class that supports comparison of any CharSequence. See Why Use ICU4J?

Collator Strength

Similar to the comparison levels of UTS #10, Java's provides provides three collator strengths., which can be set using Collator.setStrength(int newStrength). These correspond roughly to the Unicode L1, L2, and L3 comparison levels.

Collator.PRIMARY: Considers only the base character, and ignores accents and case.
Collator.SECONDARY: Considers the base character with any accents; case is ignored.
Collator.TERTIARY: Considers the base character, accents, and other variations such as case. This is the default strength.
Collator.IDENTICAL: The characters will only be considered equal if they are the exact same character.

PRIMARY and SECONDARY strengths are both case-insensitive, but only PRIMARY is insensitive to accents. The most permissive is therefore PRIMARY, while the most strict is IDENTICAL. As example consider the word cafe, which sometimes appears as café to reflect the French spelling. Using PRIMARY collator strength, the words cafe, CAFE, and café would all be considered the same word. Using SECONDARY collator strength, cafe and café would be considered different words, but cafe and CAFE would be considered the same. Finally TERTIARY collator strength would consider all three forms as distinct, receiving different orderings. (The IDENTICAL strength would consider the three forms distinct as well, and would take into account any other differences that might appear in other characters beyond characters, accents, and case.)

Collator Decomposition

As you learned in the above sections, comparison of character sequences will not be valid unless the code points have first been normalized. For collation this best done by first decomposing characters into their base characters, accents, and other marks—Normalization Form D (NFD) or Normalization Form KD (NFKD), above. By default a Collator will perform no decomposition! If you wish normalization before comparison, you must set the decomposition level using Collator.setDecomposition(int decompositionMode). The Collator.CANONICAL_DECOMPOSITION level corresponds to Normalization Form D (NFD) and is should be your first choice for the collator decomposition setting.

Collator Comparison

To illustrate the use of a collator, suppose you want to sort a list of strings without regard to case or accents. You want to normalize the strings in case some of the strings use different composition forms. You would create and configure a collator as in the following example.

Internationalization-aware case-insensitive sorting using a Collator.

final List<String> strings = new ArrayList<>(Arrays.asList("TOUCHING", "touch", "touché"));
final Collator collator = Collator.getInstance();
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
collator.setStrength(Collator.PRIMARY);
strings.sort(collator);
System.out.println(strings); //prints "[touch, touché, TOUCHING]"

As with any comparator, you can use collator.compare(string1, string2) == 0 to see if the collator considers two strings equal based upon the strength you've chosen. The method Collator.equals(String source, String target) exists as a convenience to make it easier to check string collation equality instead of checking if the comparison result is zero.

Do not use String.equals(Object anObject), String.equalsIgnoreCase(String anotherString), String.compareTo(String anotherString), or String.compareToIgnoreCase(String str) for internationalization text, because these methods do not take into account composition and diacritics. They do not support supplementary characters, and moreover they sort solely by Unicode code point value, which may not represent correct collation order in some locales. The following example shows how String.compareToIgnoreCase(…) does not correctly take accents into account:

final List<String> strings = new ArrayList<>(Arrays.asList("TOUCHING", "touch", "touché"));
strings.sort(String::compareToIgnoreCase); //WRONG; internationalization not fully supported
System.out.println(strings); //prints incorrect order "[touch, TOUCHING, touché]"

Using a collator with PRIMARY strength to ignore case and accents works well when comparing strings two at a time. But this approach breaks down when using quick lookup data structures such as a hash table, which must calculate a fixed hash code ahead of time. You can however create hash table keys by manually performing some of the same logical steps a collator would take. You will of course need also to perform the same steps on any string you want to look up in the map. The result is an efficient lookup that ignores case and accent marks.

final String value = …; //the map value
//decompose the characters into base characters and marks
String key = Normalizer.normalize(value, Normalizer.Form.NFD);
//remove the individual marks from the string
key = string.replaceAll("\\p{M}", "");
//convert all characters to lowercase
key = key.toLowerCase();

The string "\\p{M}" is a regular expression (which you will learn about in a future lesson) for identifying characters of the Unicode category Mark (M).

Review

Summary

Unicode code points for supplementary characters are encoded in Java character sequences using surrogate pairs of character values.
Some Unicode characters can stored in precomposed form; or in a decomposed form of a base character followed by combining marks.
Internationalization text must use a Collator for comparison and sorting, rather than the traditional String methods, to account for the various Unicode forms and rules.

Gotchas

Assuming that each char in a CharSequence represents a Unicode character is incorrect, and will yield erroneous results for characters beyond the Basic Multilingual Plane.
Don't use the comparison and equality methods of java.lang.String, because they don't take into account internationalization issues and will not work correctly across locales.
If you don't set the decomposition level before using a Collator, strings with differen mixes of precomposed characters will not be sorted correctly.

In the Real World

Use CharSequence rather than String or StringBuilder for the parameters of methods that process sequences of characters.
You must take into account that each char in a CharSequence is potentially part of a surrogate pair or your results will not be correct for supplementary characters.
Use the stream returned by CharSequence.codePoints() to skip the need to check for surrogates altogether.

Think About It

Do the strings you are sorting or otherwise comparing represent text from some language or locale? If so you should be creating a configuring a collator for the locale, rather than comparing using String methods.

Self Evaluation

What is the difference between internationalization and localization?
How does a Java Locale related to a language tag?
What is the difference between CharSequence.chars() and CharSequence.codePoints(), and why would you want to use one other than the other?
Which normalization form does the W3C recommend for text published on web sites? How does this form differ from other normalization forms?
Why is it necessary to use a Collator rather than the String comparison and equality methods for internationalization text?

Task

Your Booker application compares locale-sensitive information in various places, such as when sorting publications by title and looking them up by title. You now know that the approach used so far using String comparison methods is incorrect and can yield erroneous results, such as sorting in an incorrect order or failing to find a publication by its title.

Convert your publication title sorting and lookup logic to correctly take internationalization issue into account. Sorting should be performed without regard to case or diacritics. Similarly lookup based on title should work without regard to whether the user's search string contains capital letters or accents. Create unit tests showing that the new sorting and comparison works, using character sequences for testing that would fail had the traditional String comparison methods been used.

References

Resources

Acknowledgments

Image of Unicode code point U+10014 Linear B syllable MA by Ch1902 [Public domain], via Wikimedia Commons.
Image of Unicode emoji character U+1F91E HAND WITH INDEX AND MIDDLE FINGERS CROSSED from EmojiOne licensed under CC-BY 4.0.
The use of the words cafe and café as collator examples was inspired by Java™ Internationalization by Andrew Deitsch and David Czarnecki (O'Reilly, 2001)
Some symbols are from Font Awesome by Dave Gandy.