Unicode
Goals
- Be familiar with Unicode character categories.
- Understand character composition.
- Be aware of Java character surrogate pairs.
- Know how to normalize character sequence.
- Realize the importance of using a collator for internationalization string comparison.
Concepts
- base character
- Basic Multilingual Plane (BMP)
- Best Current Practice (BCP)
- bidi
- canonical
- character
- character composition
- collation
- collator
- collator strength
- diacritic
- glyph
- high surrogate
- International Components for Unicode (ICU)
- Internet Engineering Task Force (IETF)
- low surrogate
- mark
- normalization
- precomposed character
- render
- supplementary character
- surrogate pairs
- Universal Coded Character Set (UCS)
- World Wide Web Consortium (W3C)
Library
java.lang.Character
java.lang.Character.getName(int codePoint)
java.lang.Character.getType(int codePoint)
java.lang.Character.isAlphabetic(int codePoint)
java.lang.Character.isDigit(int codePoint)
java.lang.Character.isHighSurrogate(char ch)
java.lang.Character.isLetter(int codePoint)
java.lang.Character.isLetterOrDigit(int codePoint)
java.lang.Character.isLowSurrogate(char ch)
java.lang.Character.isLowerCase(int codePoint)
java.lang.Character.isMirrored(int codePoint)
java.lang.Character.isSpaceChar(int codePoint)
java.lang.Character.isTitleCase(int codePoint)
java.lang.Character.isUpperCase(int codePoint)
java.lang.Character.isSurrogate(char ch)
java.lang.Character.isWhitespace(int codePoint)
java.lang.Character.isValidCodePoint(int codePoint)
java.lang.Character.toCodePoint(char high, char low)
java.lang.CharSequence
java.lang.CharSequence.charAt(int index)
java.lang.CharSequence.chars()
java.lang.CharSequence.codePoints()
java.lang.CharSequence.length()
java.lang.CharSequence.subSequence(int start, int end)
java.lang.String
java.lang.String.compareTo(String anotherString)
java.lang.String.compareToIgnoreCase(String str)
java.lang.String.equals(Object anObject)
java.lang.String.equalsIgnoreCase(String anotherString)
java.lang.StringBuilder
java.lang.StringBuilder.appendCodePoint(int codePoint)
java.text.Normalizer
java.text.Normalizer.Form
java.text.Collator
java.text.Collator.CANONICAL_DECOMPOSITION
java.text.Collator.IDENTICAL
java.text.Collator.PRIMARY
java.text.Collator.SECONDARY
java.text.Collator.TERTIARY
java.text.Collator.compare(Object o1, Object o2)
java.text.Collator.compare(String source, String target)
java.text.Collator.equals(String source, String target)
java.text.Collator.getInstance()
java.text.Collator.getInstance(Locale desiredLocale)
java.text.Collator.setDecomposition(int decompositionMode)
java.text.Collator.setStrength(int newStrength)
java.util.stream.IntStream
java.util.Locale
java.util.Locale.CANADA_FRENCH
java.util.Locale.CHINA
java.util.Locale.ENGLISH
java.util.Locale.UK
java.util.Locale.US
java.util.Locale.forLanguageTag(String languageTag)
java.util.Locale.getAvailableLocales()
java.util.Locale.getDefault()
java.util.Locale.getDefault(Locale.Category category)
java.util.Locale.setDefault(Locale newLocale)
java.util.Locale.setDefault(Locale.Category category, Locale newLocale)
java.util.Locale.toLanguageTag()
java.util.Locale.toString()
Lesson
You learned in a previous lesson that a major part of internationalization is separating resources from code, placing display strings and related information into resource files to be localized for different locales. Internationalization however does not end at being able to substitute different strings for different languages. The processing of those strings require special cognizance of variations in writing systems and writing direction. You can still run into problems, even using Unicode string resources, if you assume that the strings follow the rules of American English. Many assumptions of a native English-speaking developer do not hold up when applied to other locales, as shown by the following figure.
The Unicode Consortium has intensely studied the various writing systems and produced rules for working with Unicode code points to allow for the variations in those systems. Studying those rules first requires attention on the Java's own representation of the Unicode character.
Characters
You were introduced to the concepts of characters in an earlier lesson on charsets. Each Unicode code point in the Universal Coded Character Set (UCS) represents a specific character, which is a distinct semantic entity without regard to the style in which it is presented. You also learned that Unicode characters with code points in the range 0x0000
– 0xFFFF
make up the the Basic Multilingual Plane (BMP). Codes beyond this range are for supplementary characters.
You cannot see
a character itself; it is only a concept. A stylistic representation of a character is called a glyph. The presentations E and E are two different glyphs representing the character named LATIN CAPITAL LETTER E
. To display a glyph representation, a computer will render the character in some font and style.
In Java a primitive char can be wrapped in a java.lang.Character
instance, but the Character
class provides also represents the Unicode concept of a character
, and provides access to much useful information about the Unicode character definitions. In this lesson
is used to indicate the Java class, while Character
character
is used to mean a Unicode character in general.
Character Properties
The Unicode Standard provides definitions all UCS characters in the Unicode Character Database (UCD), including extensive property lists and supplementary information, which it distributes as a series of text files accessible to computer processing. See Unicode Character Database for access to the UCD itself, and UAX #44: Unicode Character Database for a specification of UCD file formats.
UnicodeData.txt
file distributed with the UCD.Name | General Category | Canonical Combining Class | Bidi Class | Decomposition Type / Mapping | Numeric Type / Value | Bidi Mirrored | Simple Uppercase Mapping | Simple Lowercase Mapping | Simple Titlecase Mapping | ||
---|---|---|---|---|---|---|---|---|---|---|---|
U+0020 | SPACE | Zs | 0 | WS | N | ||||||
U+0021 | ! | EXCLAMATION MARK | Po | 0 | ON | N | |||||
U+0024 | $ | DOLLAR SIGN | Sc | 0 | ET | N | |||||
U+0033 | 3 | DIGIT THREE | Nd | 0 | EN | 3 | N | ||||
U+0045 | E | LATIN CAPITAL LETTER E | Lu | 0 | L | N | 0065 | ||||
U+0065 | e | LATIN SMALL LETTER E | Ll | 0 | L | N | 0045 | 0045 | |||
U+007B | { | LEFT CURLY BRACKET | Ps | 0 | ON | Y | |||||
U+007D | } | RIGHT CURLY BRACKET | Pe | 0 | ON | Y | |||||
U+00C9 | É | LATIN CAPITAL LETTER E WITH ACUTE | Lu | 0 | L | 0045 0301 | N | 00E9 | |||
U+00E9 | é | LATIN SMALL LETTER E WITH ACUTE | Ll | 0 | L | 0065 0301 | N | 00C9 | 00C9 | ||
U+0301 | ́ | COMBINING ACUTE ACCENT | Mn | 230 | NSM | N | |||||
U+0628 | ب | ARABIC LETTER BEH | Lo | 0 | AL | N | |||||
U+0663 | ٣ | ARABIC-INDIC DIGIT THREE | Nd | 0 | AN | 3 | N | ||||
U+092E | म | DEVANAGARI LETTER MA | Lo | 0 | L | N | |||||
U+10014 | LINEAR B SYLLABLE B080 MA | Lo | 0 | L | N | ||||||
U+1F602 | 😂 | FACE WITH TEARS OF JOY | So | 0 | ON | N |
The official UCS name of a character can be retrieved using Character.getName(int codePoint)
.
Character Categories
Code | Name | Description |
---|---|---|
Lu | Uppercase_Letter | an uppercase letter |
Ll | Lowercase_Letter | a lowercase letter |
Lt | Titlecase_Letter | a digraphic character, with first part uppercase |
LC | Cased_Letter | Lu | Ll | Lt |
Lm | Modifier_Letter | a modifier letter |
Lo | Other_Letter | other letters, including syllables and ideographs |
L | Letter | Lu | Ll | Lt | Lm | Lo |
Mn | Nonspacing_Mark | a nonspacing combining mark (zero advance width) |
Mc | Spacing_Mark | a spacing combining mark (positive advance width) |
Me | Enclosing_Mark | an enclosing combining mark |
M | Mark | Mn | Mc | Me |
Nd | Decimal_Number | a decimal digit |
Nl | Letter_Number | a letterlike numeric character |
No | Other_Number | a numeric character of other type |
N | Number | Nd | Nl | No |
Pc | Connector_Punctuation | a connecting punctuation mark, like a tie |
Pd | Dash_Punctuation | a dash or hyphen punctuation mark |
Ps | Open_Punctuation | an opening punctuation mark (of a pair) |
Pe | Close_Punctuation | a closing punctuation mark (of a pair) |
Pi | Initial_Punctuation | an initial quotation mark |
Pf | Final_Punctuation | a final quotation mark |
Po | Other_Punctuation | a punctuation mark of other type |
P | Punctuation | Pc | Pd | Ps | Pe | Pi | Pf | Po |
Sm | Math_Symbol | a symbol of mathematical use |
Sc | Currency_Symbol | a currency sign |
Sk | Modifier_Symbol | a non-letterlike modifier symbol |
So | Other_Symbol | a symbol of other type |
S | Symbol | Sm | Sc | Sk | So |
Zs | Space_Separator | a space character (of various non-zero widths) |
Zl | Line_Separator | U+2028 LINE SEPARATOR only |
Zp | Paragraph_Separator | U+2029 PARAGRAPH SEPARATOR only |
Z | Separator | Zs | Zl | Zp |
Cc | Control | a C0 or C1 control code |
Cf | Format | a format control character |
Cs | Surrogate | a surrogate code point |
Co | Private_Use | a private-use character |
Cn | Unassigned | a reserved unassigned code point or a noncharacter |
C | Other | Cc | Cf | Cs | Co | Cn |
Each character is assigned a general category
, which is important for determining how characters relate to other characters. A character may be considered e.g. a lowercase or uppercase letter; a number; or punctuation. The general category column in the above table contains codes specifying the category shown in the figure to the side. Unicode also defines category groups that are shorthands for several categories together, but these are not used in the definitions of the individual characters.
- The space character
U+0020
is aSpace_Separator
(Zs
). - The exclamation mark ! character
U+0021
isOther_Punctuation
(Po
); while the left curly bracket { characterU+007B
and right curly bracket { characterU+007D
are consideredOpen_Punctuation
(Ps
) andClose_Punctuation
(Pe
), respectively, - The dollar sign $ character
U+0024
is aCurrency_Symbol
(Sc). - The digit 3
U+0033
and Arabic digit ٣U+0663
are consideredDecimal_Number
(Nd
). - The characters E
U+0045
and ÉU+00C9
areUppercase_Letter
(Lu
), while the characters eU+0065
, and éU+00E9
areLowercase_Letter
(Ll
). - The letters ب
U+0628
, मU+092E
, andU+10014
are consideredOther_Letter
(Lo
); as Arabic, Devanagari, and Linear B respectively have no concept of uppercase or lowercase. - The combining acute accent character
U+0301
is considered aNonspacing_Mark
(Mn
), the significance of which will be explained below under Composition.
Java refers to a character's general category as its type
, which can be retrieved using Character.getType(int codePoint)
. The return value is a string representing one of the general categories in the categories table in the side figure. Rather than querying the type generally, the Character
class has several convenience methods for querying specific categories. The convenience methods for retrieving the category groups is especially handy, as they encompass several types.
Character.isAlphabetic(int codePoint)
Uppercase_Letter
(Lu
),Lowercase_Letter
(Ll
),Titlecase_Letter
(Lt
),Modifier_Letter
(Lm
),Other_Letter
(Lo
),Letter_Number
(Nl
), or contributory propertyOther_Alphabetic
Equivalent to groupLetter
(L
) with the addition ofLetter_Number
(Nl
).Character.isDigit(int codePoint)
Decimal_Number
(Nd
).Character.isLetter(int codePoint)
Uppercase_Letter
(Lu
),Lowercase_Letter
(Ll
),Titlecase_Letter
(Lt
),Modifier_Letter
(Lm
), orOther_Letter
(Lo
). Equivalent to groupLetter
(L
).Character.isLetterOrDigit(int codePoint)
Uppercase_Letter
(Lu
),Lowercase_Letter
(Ll
),Titlecase_Letter
(Lt
),Modifier_Letter
(Lm
),Other_Letter
(Lo
), orDecimal_Number
(Nd
). Equivalent to groupLetter
(L
) with the addition ofDecimal_Number
(Nd
).Character.isLowerCase(int codePoint)
Lowercase_Letter
(Ll
) or contributory propertyOther_Lowercase
.Character.isMirrored(int codePoint)
- Mirrored characters such as categories
Open_Punctuation
(Ps
) andClose_Punctuation
(Pe
). Character.isSpaceChar(int codePoint)
Space_Separator
(Zs
),Line_Separator
(Zl
), orParagraph_Separator
(Zp
). Equivalent to groupSeparator
(Z
).Character.isTitleCase(int codePoint)
Titlecase_Letter
(Lt
).Character.isUpperCase(int codePoint)
Uppercase_Letter
(Lu
) or contributory propertyOther_Uppercase
.
Bidi Properties
Some scripts such as Arabic and Hebrew are written from right-to-left rather than left-to-right. Several properties, such as Bidi Class
and Bidi Mirrored
in the table above, provide extensive information on the bidirectional or bidi characteristics of each character. This information is used when rendering a string of characters to actually display them to the user, which uses a complex set of rules described in AUX #9: Unicode Bidirectional Algorithm. Storing and retrieving internationalization strings can be done in large part without regard to bidi-related issues, which may be left to the rendering code that actually prints or displays the text.
Composition
The ISO-8859-1 charset as you saw in an earlier lesson increased the ASCII repertoire mostly by adding many accented letters found in European languages, such as the é
U+00E9
that appears in touché
. Many languages that use these accented characters do not consider them to be different letters as such. Rather the accent or diacritic (from diacritical mark
) is added to a letter to ensure that its sound remains the same in the presence of other letters. (This is similar to how English adds a letter e
to the end of words such as cane
so that the vowel sound of can
is changed). A diacritic is one of several types of marks, as indicated by the category Mark
(M
) in the character category table above.
The UCS provides separate diacritics and other marks in the category Nonspacing_Mark
, which may be placed after a base character with the understanding that the base character and the mark(s) will be shown to the user as one character. Rather storing a single code point U+00E9
representing é
, you could store the code point U+0065
representing the base character e
, followed by the code point U+0301
for the combining acute accent mark. Through character composition this will create what appears to be the same character: é
. The letter é
U+00E9
is called a precomposed character because its form is already composed of the base character U+0065
and the combining mark U+0301
.
Character Sequences
Many times in Java you will need to deal with more than just single characters. As you saw above, at times what may appear as a single glyph is in reality stored as a combination of a base character followed by combining mark characters. The character sequence you are most familiar with is java.lang.String
.
CharSequence
A more general interface for dealing with sequences of characters is java.lang.CharSequence
. It provides many of the same char
access methods you may already have noticed in String
, such as CharSequence.length()
and CharSequence.charAt(int index)
. Both String
and java.lang.StringBuilder
implement the CharSequence
interface.
CharSequence
can also provide its char
values as an IntStream
via the CharSequence.chars()
method. Counting occurrences of 'E'
/ 'e'
in a could be accomplished much more compactly (and potentially more efficiently) using stream filtering and counting.
Another useful method is CharSequence.subSequence(int start, int end)
, which functions similarly to collection views such as java.util.List.subList(int fromIndex, int toIndex)
. The end
index is exclusive. Thus calling "touché".subSequence(1, 5)
would yield the string "ouch"
.
Surrogate Pairs
Some UCS code points such as the Linear B symbol U+10014
in the table above are outside the BMP and require more than the 16 bits available in a single char
. When working with the Character
class, we got around this limitation by using a 32-bit int
as a parameter in methods such as Character.isValidCodePoint(int codePoint)
. CharSequence
implementations such as String
, however, are conceptually are made up of char
values. To understand how Java provides access to supplementary characters, you must first understand some of the details of UTF-16 encoding.
When discussing charsets you learned that both UTF-8 and UTF-16 are variable-length encodings, and both cover the the entire range of UCS code points. UTF-8 can use one, two, three, or even more bytes to encode a single code point using an involved algorithm. UTF-16 on the other hand uses a simpler approach. For any code points between 0x0000
and 0xFFFF
, UTF-16 will use a single 16-bit value to represent the character. For supplementary characters (U+10000
and above) UTF-16 will use two subsequent 16-bit values called a surrogate pair. The first value or high surrogate will be in the range 0xD800
– 0xDBFF
, while the second value or low surrogate will be in the range 0xDC00
– 0xDFFF
. This encoding technique only works because there exist no UCS characters that use code points in the range 0xD800
– 0xDFFF
.
Java uses UTF-16 in CharSequence
and its implementations. This has a startling implication: not every char
value represents a Unicode code point. Every char
in a CharSequence
such as String
is potentially part of a surrogate pair, which must be decoded to discover the UCS code point! To discover if a char
value is part of a surrogate pair, use Character.isSurrogate(char ch)
. Because surrogate pairs come in a certain order, you can use Character.isHighSurrogate(char ch)
to determine if a surrogate pair is starting, followed by Character.isLowSurrogate(char ch)
for the subsequent character of any encountered pair. After detecting a surrogate pair, you can determine the encoded Unicode code point using Character.toCodePoint(char high, char low)
.
As an example of how not to process a character sequence, the following code naively looks at each char
without regard to whether a surrogate pair is present when counting characters in the Letter
(L
) category.
Unfortunately the above implementation would skip the Linear B symbol U+10014
altogether. Even though Java recognizes it as being in the Unicode category Other_Letter
(Lo
), it is encoded in the string as the UTF-16 surrogate pair 0xD800
and 0xDC14
. Neither of these char
values on its own represents a letter. To correct the algorithm, one would need to use the Character
surrogate pair detection and conversion methods explained above, as illustrated in the figure below.
Besides CharSequence.chars()
there is another method that returns a stream: CharSequence.codePoints()
. Although both methods return an IntStream
, CharSequence.codePoints()
returns a sequence of Unicode code point values instead of char
values. Unlike CharSequence.chars()
, CharSequence.codePoints()
will never return a surrogate pair, and you will never have to check for them! Just be careful not to cast the values returned by CharSequence.codePoints()
to char
.
Using CharSequence.codePoints()
the letter-counting algorithm above could be made must more compact and readable.
Normalization
You saw above that some characters such as é
U+00E9
have both precomposed forms, as well as separate forms with accents and other characters stored as distinct code points. This can cause major problems with even simple operations such as searching for a character or comparing strings. You surely would not want to separate searches for U+00E9
and for the sequence U+0065
U+0301
, just to cover all composition forms of é
. Imagine searching for letters that have decomposed forms consisting of three or more characters! Trying to check for all representations would furthermore become a nightmare when trying to compare strings.
Code | Name | Description |
---|---|---|
NFD | Normalization Form D | Canonical Decomposition |
NFC | Normalization Form C | Canonical Decomposition , followed by Canonical Composition |
NFKD | Normalization Form KD | Compatibility Decomposition |
NFKC | Normalization Form KC | Compatibility Decomposition , followed by Canonical Composition |
Unicode's solution to the problem of precomposed characters is normalization, the process of converting characters to some normal form. Here normal
simply means some common, agreed upon representation; and UAX #15: Unicode Normalization Forms defines four such normalization forms, displayed in the accompanying figure. Normalization occurs by some sequence of decomposing the precomposed characters and optionally composing them into precomposed forms (whether they started that way or not). The most important of these forms use canonical decomposition/composition, using the forms preferred by The Unicode Standard.
Two strings normalized to the same normalization form can be safely compared, because the composition of each character is guaranteed to be the same by the algorithm. Which form you choose depends on your needs. If you wanted simply to compare two strings, you could use form NFD
and decompose all the characters to their canonical decomposed forms, with no need to put them back into precomposed forms. If you were searching for a character you knew was in a canonical precomposed form, such as é
U+00E9
, you might choose form NFC
to also convert the characters into their canonical precomposed form for easy matching of the single code point.
Java provides the class java.text.Normalizer
which implements the normalization algorithm in UAX #15, representing the various forms using the java.text.Normalizer.Form
enum. The following example shows how to search for the canonical precomposed form of é
U+00E9
in a string that initially used the decomposed form e
U+0065
followed by the combining acute accent ́ U+0301
, by first normalizing the string using Normalization Form C
(NFC
).
Collation
The Java Comparator.compare(T o1, T o2)
method, the sorting strategy you've used in several lessons, allows any two objects to be compared. The trick is deciding how a certain type of object should be ordered. Sorting numbers could hardly be simpler: simply compare the values arithmetically. It would be a mistake to think that sorting Unicode code points were that simple; human languages have evolved haphazardly for thousands of years, and the result is that sorting text or collation is replete with complications, confusions, and contradictions. Sor
Many a developer has naively tried to sort strings by comparing the Unicode code point value of individual characters, as in the following example:
For comparing simple words in English using all lowercase letters such as cat
and car
, this algorithm works! But in real life this approach quickly runs into trouble. You may want to refer to the charts in the lesson on charsets.
- Case
- From the ASCII table, the uppercase letters
A
U+0041
–Z
U+00FA
appear separately and before the lowercase lettersa
U+0061
–z
U+007A
. This means thatZanzibar
with a capital letter would be sorted beforeapple
! - Diacritics
- From the ISO-8859-1 table, the accented letter
é
U+00E9
appears after all the unaccented letters, both upper and lowercase, so thattouching
would be sorted beforetouché
! - Composition
- The accented letter
é
U+00E9
may be stored in decomposed form ase
U+0065
followed by the accent mark ́U+0301
. The sorting would change based on whether the precomposed form was used. The combining mark(s) would also cause, so thattouched
would appear beforetouché
stored in decomposed form.
You might have guessed that getting around the composition problem might involve some form of normalization. Thinking further, you might realize that accents could be ignored if the sequence were first decomposed, perhaps using Normalization Form D
(NFD
), and then removing the decomposed accent characters before sorting. To ignore differences in case, you could have some way to map the uppercase characters to lowercase characters before sorting.
In fact Unicode provides just such a mapping! Looking at the UCD table at the beginning of this lesson, you'll see for example that the character E
U+0045
has a simple lowercase mapping
of U+0065
, the code point for e
. In addition to mapping case, however, you have to decide if the original case should be considered, as in some contexts the user would still expect an uppercase version of a letter to go before (or after) the lowercase version.
Rather than doing all this work manually, however, you should use the collation tools Java puts at your disposal in the form of a collator.
Level | Description | Example |
---|---|---|
L1 | Base characters | role < roles < rule |
L2 | Accents | role < rôle < roles |
L3 | Case / Variants | role < Role < rôle |
L4 | Punctuation | role < “role” < Role |
Ln | Identical | role < ro□le < “role” |
Collators
Unicode provides UTS #10: Unicode Collation Algorithm to specify the steps to take when sorting sequences of characters. Because ordering expectations may differ based upon whether words are appearing in a dictionary or in a user list, for example, UTS #10 prescribes a multilevel comparison algorithm. The table lists these Comparison Levels
to chose from when sorting text according to Unicode rules.
The the Unicode collation algorithm is difficult; UTS #10 is dense and complicated. To simplify things somewhat Java provides the java.text.Collator
class. A Collator
is a comparator that knows how to normalize strings and then compare them taking into account accents and case as necessary and appropriate for different locales. You can get a collator for the current locale using Collator.getInstance()
, or for a particular locale using Collator.getInstance(Locale desiredLocale)
. The Collator.compare(String source, String target)
method is used as you would for a Comparator<String>
.
Collator Strength
Similar to the comparison levels of UTS #10, Java's provides provides three collator strengths., which can be set using Collator.setStrength(int newStrength)
. These correspond roughly to the Unicode L1
, L2
, and L3
comparison levels.
Collator.PRIMARY
- Considers only the base character, and ignores accents and case.
Collator.SECONDARY
- Considers the base character with any accents; case is ignored.
Collator.TERTIARY
- Considers the base character, accents, and other variations such as case. This is the default strength.
Collator.IDENTICAL
- The characters will only be considered equal if they are the exact same character.
PRIMARY
and SECONDARY
strengths are both case-insensitive, but only PRIMARY
is insensitive to accents. The most permissive is therefore PRIMARY
, while the most strict is IDENTICAL
. As example consider the word cafe
, which sometimes appears as café
to reflect the French spelling. Using PRIMARY
collator strength, the words cafe
, CAFE
, and café
would all be considered the same word. Using SECONDARY
collator strength, cafe
and café
would be considered different words, but cafe
and CAFE
would be considered the same. Finally TERTIARY
collator strength would consider all three forms as distinct, receiving different orderings. (The IDENTICAL
strength would consider the three forms distinct as well, and would take into account any other differences that might appear in other characters beyond characters, accents, and case.)
Collator Decomposition
As you learned in the above sections, comparison of character sequences will not be valid unless the code points have first been normalized. For collation this best done by first decomposing characters into their base characters, accents, and other marks—Normalization Form D
(NFD
) or Normalization Form KD
(NFKD
), above. By default a Collator will perform no decomposition! If you wish normalization before comparison, you must set the decomposition level using Collator.setDecomposition(int decompositionMode)
. The Collator.CANONICAL_DECOMPOSITION
level corresponds to Normalization Form D
(NFD
) and is should be your first choice for the collator decomposition setting.
Collator Comparison
To illustrate the use of a collator, suppose you want to sort a list of strings without regard to case or accents. You want to normalize the strings in case some of the strings use different composition forms. You would create and configure a collator as in the following example.
Review
Summary
- Unicode code points for supplementary characters are encoded in Java character sequences using surrogate pairs of character values.
- Some Unicode characters can stored in precomposed form; or in a decomposed form of a base character followed by combining marks.
- Internationalization text must use a
Collator
for comparison and sorting, rather than the traditionalString
methods, to account for the various Unicode forms and rules.
Gotchas
- Assuming that each
char
in aCharSequence
represents a Unicode character is incorrect, and will yield erroneous results for characters beyond the Basic Multilingual Plane. - Don't use the comparison and equality methods of
java.lang.String
, because they don't take into account internationalization issues and will not work correctly across locales. - If you don't set the decomposition level before using a
Collator
, strings with differen mixes of precomposed characters will not be sorted correctly.
In the Real World
- Use
CharSequence
rather thanString
orStringBuilder
for the parameters of methods that process sequences of characters. - You must take into account that each
char
in aCharSequence
is potentially part of a surrogate pair or your results will not be correct for supplementary characters. - Use the stream returned by
CharSequence.codePoints()
to skip the need to check for surrogates altogether.
Think About It
- Do the strings you are sorting or otherwise comparing represent text from some language or locale? If so you should be creating a configuring a collator for the locale, rather than comparing using
String
methods.
Self Evaluation
- What is the difference between internationalization and localization?
- How does a Java
Locale
related to alanguage tag
? - What is the difference between
CharSequence.chars()
andCharSequence.codePoints()
, and why would you want to use one other than the other? - Which normalization form does the W3C recommend for text published on web sites? How does this form differ from other normalization forms?
- Why is it necessary to use a
Collator
rather than theString
comparison and equality methods for internationalization text?
Task
Your Booker application compares locale-sensitive information in various places, such as when sorting publications by title and looking them up by title. You now know that the approach used so far using String
comparison methods is incorrect and can yield erroneous results, such as sorting in an incorrect order or failing to find a publication by its title.
Convert your publication title sorting and lookup logic to correctly take internationalization issue into account. Sorting should be performed without regard to case or diacritics. Similarly lookup based on title should work without regard to whether the user's search string contains capital letters or accents. Create unit tests showing that the new sorting and comparison works, using character sequences for testing that would fail had the traditional String
comparison methods been used.
See Also
- Internationalization and localization (Wikipedia)
- Trail: Internationalization (Oracle - The Java™ Tutorials)
- Lesson: Working with Text (Oracle - The Java™ Tutorials)
- Normalizing Text (Oracle - The Java™ Tutorials)
- Comparing Strings (Oracle - The Java™ Tutorials)
- Unicode surrogate programming with the Java language (Masahiko Maedera, IBM developerWorks) Not updated to latest Java API.
- Java Internationalization (Andy Deitsch, David Czarnecki - O'Reilly, 2001)
References
- The Unicode Standard (Unicode Consortium)
- AUX #9: Unicode Bidirectional Algorithm
- UAX #15: Unicode Normalization Forms
- UAX #44: Unicode Character Database
- UTS #10: Unicode Collation Algorithm
- Unicode FAQ: Normalization (Unicode Consortium)
Resources
- Internet Engineering Task Force (IETF)
- W3C Internationalization (i18n) Activity
- IANA Language Subtag Registry
- Unicode Character Database
- International Components for Unicode (ICU)
Acknowledgments
- Image of Unicode code point
U+10014
Linear B syllable MA
by Ch1902 [Public domain], via Wikimedia Commons. - Image of Unicode emoji character
U+1F91E
HAND WITH INDEX AND MIDDLE FINGERS CROSSED
from EmojiOne licensed under CC-BY 4.0. - The use of the words
cafe
andcafé
as collator examples was inspired by Java™ Internationalization by Andrew Deitsch and David Czarnecki (O'Reilly, 2001) - Some symbols are from Font Awesome by Dave Gandy.