Charsets
Goals
- Understand text and characters.
- Distinguish between character sets and character encodings.
- Work with standard charsets.
Concepts
- American Standard Code for Information Interchange (ASCII)
- American National Standards Institute (ANSI)
- Basic Multilingual Plane (BMP)
- big-endian
- code page
- code point
- byte order
- byte order mark (BOM)
- character
- character encoding
- character set
- coded character set
- control character
- endianness
- International Organization for Standardization (ISO)
- Internet Assigned Numbers Authority (IANA)
- little-endian
- signature
- text
- Unicode
- Unicode block
- Unicode Consortium
- Unicode plane
- Universal Coded Character Set (UCS)
- variable-length encoding
Library
java.lang.System.getProperty(String key)
java.io.InputStream
java.io.OutputStream
java.nio.charsets.Charset
java.nio.charsets.Charset.forName(String charsetName)
java.nio.charsets.Charset.name()
java.nio.charsets.StandardCharsets
java.nio.charsets.StandardCharsets.US_ASCII
java.nio.charsets.StandardCharsets.UTF_8
Lesson
Storing and working with written information should be easy. It's just working with a lot of letters, right? But it's not easy, and there are many details and complications to handling the symbols that make up human languages. And most developers get it wrong.
First of all, when we deal with text we are dealing with more than letters—we are dealing with other symbols such as punctuation. And we should take care to handle letters in other languages—there must be a lot of them. But how do we represent them? Unfortunately over the years different groups of people have used different approaches to representing these things—and usually they resorted to simplistic ways of storing them in files. We have to untangle the whole mess or we'll have yet another program that can display Hello World!
but that has a problem handling even English words that come from other languages.
To illustrate a bit of the complication, consider a file containing the following bytes:
0xEF | 0xBB | 0xBF | 0x74 | 0x6F | 0x75 | 0x63 | 0x68 | 0xC3 | 0xA9 |
What do these bytes mean? Maybe the number represent some letters. But which letters? And what numbers are we dealing with, anyway—are these eight-bit numbers, or are these 16-bit numbers (with each number taking two bytes)?
Characters
To start attacking the problem, we have to first determine what we're dealing with. We're going to call the basic unit of text a character. This could be the letter 'A'
or the asterisk *
character. Even the spaces between these words could be considered characters.
Character Sets
A character set is simply an identified group of characters. If we assign some code or number to each character, then we can represent them on a computer; the set of characters then becomes a coded character set. But what codes should we assign to the characters? As you might have guessed, others have made proposals on what codes to use. Here are a few interesting ones through history.
ASCII
One of the most famous coded characters sets is the American Standard Code for Information Interchange, or ASCII (pronounced ass-kee). It was made in America, and it's as if no one at the time thought anyone other than Americans would ever use computers—it only supported the 26 letters of the English alphabet and some punctuation. In all it maps out 128 codes (ending at 0x7F
) to represent characters (some of which are control characters representing non-displayable actions such as a BS
, a backspace).
row+col | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 | 0x08 | 0x09 | 0x0A | 0x0B | 0x0C | 0x0D | 0x0E | 0x0F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x00 | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
0x10 | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
0x20 | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
0x30 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
0x40 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
0x50 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
0x60 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
0x70 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
The good news is that with 128 combinations, ASCII only uses seven bits of information, which can fit in a single byte. The bad news is that with only 128 combinations, what it can show is extremely limited. There is no c-cedilla ç
character, for example, so we can't even represent the word "façade".
The official name given by the Internet Assigned Numbers Authority (IANA) is US-ASCII
.
ISO-8859-1
The International Organization for Standardization (ISO) came up with a larger set of codes called ISO/IEC 8859-1, which IANA officially refers to asISO-8859-1
. This set comprises 256 different codes, each of which takes up eight bits instead of ASCII's seven bits—but each of which can still fit in a single byte
row+col | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 | 0x08 | 0x09 | 0x0A | 0x0B | 0x0C | 0x0D | 0x0E | 0x0F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x00 | ||||||||||||||||
0x10 | ||||||||||||||||
0x20 | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
0x30 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
0x40 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
0x50 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
0x60 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
0x70 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | |
0x80 | ||||||||||||||||
0x90 | ||||||||||||||||
0xA0 | NBSP | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | SHY | ® | ¯ |
0xB0 | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
0xC0 | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
0xD0 | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
0xE0 | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
0xF0 | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
Windows-1252
The Microsoft Windows operating system, depending on the country for which it is installed, uses a code page (essentially a coded character set) to represent characters. In the United States Windows uses CP-1252, or simply Windows-1252, which is a set of 128 codes that are almost exactly the same as ISO-8859-1. Almost, but not quite: note, for example, the addition of the euro € character for code 0x80, which does not exist in ISO-8859-1.
row+col | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 | 0x06 | 0x07 | 0x08 | 0x09 | 0x0A | 0x0B | 0x0C | 0x0D | 0x0E | 0x0F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0x00 | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
0x10 | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
0x20 | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
0x30 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
0x40 | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
0x50 | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
0x60 | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
0x70 | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
0x80 | € | ‚ | ƒ | „ | … | † | ‡ | ˆ | ‰ | Š | ‹ | Œ | Ž | |||
0x90 | ‘ | ’ | “ | ” | • | – | — | ˜ | ™ | š | › | œ | ž | Ÿ | ||
0xA0 | NBSP | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | SHY | ® | ¯ |
0xB0 | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
0xC0 | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
0xD0 | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
0xE0 | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
0xF0 | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
Unicode
For years ISO has been working with the Unicode Consortium to produce ISO/IEC 10646, the Universal Coded Character Set (UCS). This character set is meant to assign codes to all knows characters used by all human languages. The Unicode Consortium produces The Unicode Standard, often referred to as simply Unicode
, which contains identical code mappings as ISO/IEC 10646 with the addition of rules about how to use and manipulate those codes in applications.
Each code in the UCS is referred to as a code point. Unicode code points range from 0x00 to 0x10FFFF and are divided into 16 different Unicode planes numbered 0–15. The most common plane, covering code points 0x0000–0xFFFF, is called the Basic Multilingual Plane (BMP). Code points in the BMP are frequently represented as
, where U+XXXX
XXXX
is the hexadecimal representation of the code point.
Each plane is subdivided into ranges of related code points called a Unicode block. These include Basic Latin
(U+0000
–U+007F
), coding the ASCII characters; Devanagari
(U+0900
–U+097F
), coding characters used in Hindi, Hiragana
(U+3040
–U+309F
), coding characters for one component of the Japanese writing system; and CJK Unified Ideographs
(U+4E00
–U+9FFF
), a very large block containing a unified coding of the characters used by Chinese, Japanese, and Korean.
The Unicode Standard not only defines which characters are assigned which code points, it also provides a database of extensive properties about the identified characters. Each character has an official Unicode name, as well as a General Category
. Just a few examples of categories include Ll
(Letter, lowercase
) for lowercase letters, Nd
(Number, decimal digit
) for numeric digits in various languages, Pd
(Punctuation, dash
) for different types of dashes with different meanings, Sm
(Symbol, math
) for mathematical symbols such as operators, and Zs
(Separator, space
) for many types of spaces such as a nonbreaking space. Other Unicode character properties indicate how a character is rendered in other writing systems, letter casing and script association.
Character Encodings
Many computer systems have either fully embraced Unicode or are in the process of switching. Microsoft Windows uses Unicode, and Java uses Unicode exclusively to represent characters and strings. So choosing a coded character set for a modern application is simple: use Unicode.
Now that you've determined which codes represent which characters, if you store those codes somewhere (such as reading them from an java.io.InputStream
or writing them to a java.io.OutputStream
) you'll need to find some way to convert those codes into individual bytes. The approach used to encode character codes into a byte stream is called a character encoding, and there several of those as well.
One Byte per Character Code
touchéusing the
ISO-8859-1
character set.0x74 | 0x6F | 0x75 | 0x63 | 0x68 | 0xE9 |
t | o | u | c | h | é |
The simplest character encoding is to place each code into a single byte. With ASCII this is simple, as we only need seven bits after all to store each ASCII code. Even ISO-8859-1
never uses more than eight bits for any character code. We could therefore encode the word touché
, using ISO-8859-1
, using the byte per code encoding shown in the figure.
Two Bytes per Character Code
The Universal Coded Character Set used by Unicode contains many, many more codes than can be represented by a single byte. We could therefore decide to use two bytes to represent each code. With this approach we are essentially using 16 bits to represent each code—the equivalent of a Java short
type. Stored with the high-order bits first, our encoding of the word touché
using two bytes to encode each Unicode code point would look like this:
touchéusing the
ISO-8859-1
character set.0x00 | 0x74 | 0x00 | 0x6F | 0x00 | 0x75 | 0x00 | 0x63 | 0x00 | 0x68 | 0x00 | 0xE9 |
t | o | u | c | h | é |
Using two bytes to represent each code brings up another question: which of those two bytes should be placed first in the stream? The above example shows, for each pair of bytes, the first byte in the stream as the one containing the high-order bits (which in this case are each 0x00
because the code values are low). This in fact may seem logical; if we write the decimal number 555, for example, the higher-order digits (e.g. the one representing 500
) comes before the lower-order digits (the one representing 50
, for instance). But some platforms have traditionally placed the byte containing the low-order bits first in memory.
touchéusing the
ISO-8859-1
character set.0x74 | 0x00 | 0x6F | 0x00 | 0x75 | 0x00 | 0x63 | 0x00 | 0x68 | 0x00 | 0xE9 | 0x00 |
t | o | u | c | h | é |
We call these two approaches the byte order, and we have two names for them based upon which end
of the value comes first. If the big
part of the value comes first, we call it big-endian byte order; if the little
part of the value comes first, we call it little-endian byte order. The byte order is therefore sometimes referred to as endianness.
UTF-8
So far using two bytes to encode each character code works fine, but you might have noticed a problem: we've doubled the amount of storage space we need! For English and even general Latin-based alphabets, most of the time we don't even need more than a single byte for each character code. It's only in those rare instances in which we need to represent non-ASCII characters are we forced to use more than one byte.
The UTF-8 encoding was invented to solve this problem. It follows a slightly more complicated set of rules, but supports all Unicode code points:
- If the code point is less than
0x80
(i.e. if it is anUS-ASCII
character code), use one byte to represent the character. - If the code point is
0x80
or above (includingISO-8859-1
codes above the ASCII range) , use two, three, or more bytes to encode the code point using the UTF-8 algorithm.
touchéusing Universal Coded Character Set of Unicode.
0x74 | 0x6F | 0x75 | 0x63 | 0x68 | 0xC3 | 0xA9 |
t | o | u | c | h | é |
Thus the character A
(U+0041
) would be encoded as the single byte 0x41
, while the character é
(U+0xE9
) would be encoded as the two bytes 0xC3
and 0xA9
. The Hindi letter म (U+092E
) would be encoded as three bytes: 0xE0 0xA4 0xAE
.
Revisiting the word touché
, still representing the characters in Unicode but encoding them in UTF-8 produces the series of bytes in the figure on the side. The figure below provides more in-depth examples of the distribution of the bits of several code points across their mult-byte encoding in UTF-8.
Character | Code Point | Binary code point | Binary UTF-8 | Hexadecimal UTF-8 |
---|---|---|---|---|
$ | U+0024 | 010 0100 | 00100100 | 0x24 |
¢ | U+00A2 | 000 1010 0010 | 11000010 10100010 | 0xC2 0xA2 |
€ | U+20AC | 0010 0000 1010 1100 | 11100010 10000010 10101100 | 0xE2 0x82 0xAC |
𐍈 | U+10348 | 0 0001 0000 0011 0100 1000 | 11110000 10010000 10001101 10001000 | 0xF0 0x90 0x8D 0x88 |
UTF-16
UTF-16 is a very efficient encoding scheme for text that is likely to be made up primarily of English and Latin words. But if you know that much of your text will be using high Unicode code point values, you might as well use two bytes for each character. The UTF-16 encoding is very similar to the two-bytes-per-character-code encoding above, except that it is also a variable-length encoding scheme that will use more than two bytes in some situations.
Like the two-bytes-per-code encoding above, UTF-16 will use at least two bytes to represent each code. UTF-16 similarly has two possible byte orders. Big-endian UTF-16 is referred to as UTF-16BE, and little-endian UTF-16 is referred to as UTF-16LE.
UTF-32
Unlike UTF-8 and UTF-16, the UTF-32 encoding scheme uses a fixed length encoding of exactly four bytes per code point. Being fixed-length means that UTF-32 code points can be indexed by their position in a stream of bytes. Its use of four bytes makes it a very inefficient encoding for storing long strings of Latin characters, however.
UTF-32 also comes in UTF32BE and UTF-32LE big-endian and little-endian variations, respectively.
Byte Order Marks
Signature / BOM | Character Encoding | Endianness |
---|---|---|
0xEF 0xBB 0xBF | UTF-8 | N/A |
0xFE 0xFF | UTF-16 | BE |
0xFF 0xFE | UTF-16 | LE |
0x00 0x00 0xFE 0xFF | UTF-32 | BE |
0xFF 0xFE 0x00 0x00 | UTF-32 | LE |
So if we have a sequence of bytes, even if we know that it represents the Unicode character set, how do we know which character encoding is being used so that we can extract those Unicode code points? For files in a file system, a standard approach exists for placing a special series of bytes called a signature at the beginning of the byte stream.
For encodings that support endianness, the Unicode byte order mark (BOM) character U+FEFF
is used to signal not only the character encoding in use, but also the byte order of the encoding (e.g. big-endian or little-endian). When an application reads a text file starting with a byte order mark, the application uses the BOM to determine how the remaining bytes should be interpreted. However after reading the file the BOM is not included in the actual content!
Adding a BOM to our UTF-8 encoded Unicode characters touché
now provides us with the bytes that were presented in the lesson's introduction.
touchéwith a UTF-8 signature.
0xEF | 0xBB | 0xBF | 0x74 | 0x6F | 0x75 | 0x63 | 0x68 | 0xC3 | 0xA9 |
UTF-8 BOM | t | o | u | c | h | é |
Charsets
You've therefore learned that to store text in a file, you need to know both the character set (the codes) you're using, as well as the encoding scheme (how to convert those codes to bytes)—including the byte order. Thus UTF-16BE
indicates the charset of 1) The Unicode character set, 2) the UTF-16 character encoding, and 3) the big-endian byte order.
Java has a class java.nio.charsets.Charset
to represent a charset. You can ask for a Charset
instance using the Charset.forName(String charsetName)
static factory method, passing in the charset name. All JVMs are required to support the charsets identified by the following names: US-ASCII
, ISO-8859-1
, UTF-8
, UTF-16BE
, UTF-16LE
, and UTF-16
.
Review
Summary
In order to store text in a file, you need to know:
- The character set (the codes) you're using. (Recommended: Unicode)
- The character encoding scheme, or how to convert those character codes to bytes (Recommended: UTF-8)
- The byte order of the encoding scheme.
All of this information is encapsulated by a named charset
.
Gotchas
US-ASCII
only represents 128 codes. If you have a character code128
or above, it isn't ASCII!- There is no such thing as the
ANSI character set
, although many people incorrectly use that terminology to refer to Windows-1252. - Don't use the Windows-1252 character set; it is different from ISO-8859-1 incompatible with Unicode.
- UTF-8 is a variable-length encoding; only character codes within the ASCII range will be encoded in single bytes.
- If you detect a BOM when processing a text file, don't include the BOM in the actual text content; discard the BOM after determining the encoding and byte order.
In the Real World
- For new programs, use the Unicode character set and store your text encoded in UTF-8 unless you have a reason to do otherwise.
- Many legacy Java APIs still use strings to represent charsets by name, but you should use
Charset
instances whenever possible.
Self Evaluation
- What is the difference between Unicode and UTF-8?
- What does
ANSI
refer to? - Why is UTF-8 so efficient for storing English words?
Task
Add the capability to the Booker application to print the name of the application user, loaded from a configuration file.
- Create a file named
user.txt
in the.booker
subdirectory of the user's home directory. Use a text editor that understands UTF-8 or use a hex editor. Make sure the file has the UTF-8BOM
at the beginning. Provide some name (using only ASCII characters) in this file. - Create a utility class called
IO
and create a utility method calledreadText(…)
that takes aPath
and returns the string contents of that path.- The method should throw an
IOException
if there are any problems reading the file. - First check for the UTF-8 BOM. If it there, read it and discard it. If it is not present, assume that the file is stored in UTF-8. Whether this assumption is valid depends on the context! Many modern formats default to UTF-8, and here you control the file in question so you know what its encoding should be. But be sure and research if the file you are reading can be assumed to contain UTF-8 if it has no signature indicating the charset.
- You don't really want to implement the entire UTF-8 algorithm, but you instead take a shortcut: you know that in a UTF-8 file, the only multi-byte encodings will be those for Unicode code points that lie beyond the ASCII range. Put another way, if a UTF-8 file contains only ASCII characters, the file will never use more than one byte per character. So read all the characters and put them in a string to return. Check each character; if any byte indicates a multi-byte encoding, throw an
IOException
saying that the UTF-8 algorithm hasn't been completely implemented yet. (Make sure all this is documented in the API documentation.) - To test this, you'll want to have two
readText(…)
methods: one that takes a path, and the other that takes an input stream; the first will delegate to the other. Create unit tests that create input streams based upon byte arrays. Test a byte array using a simple array of bytes representing characters. Test a byte array with and without a BOM. Test a byte array with BOM containing the UTF-8 encoding of the wordtouché
. Some of these will expect exceptions to be thrown.
- The method should throw an
- When the Booker application runs, check to see if there is a
user.txt
file in the configuration directory.- If there is no such file, generate one containing the current system user account name. You can retrieve the name of the current system logged-in user using
System.getProperty(String key)
with the key"user.name"
. - If that file exists, open and read it to retrieve the name of the user.
- If there is no such file, generate one containing the current system user account name. You can retrieve the name of the current system logged-in user using
- When listing files, print the user's name to the console (e.g. Library of Jane Doe) before continuing.
See Also
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- What every programmer absolutely, positively needs to know about encodings and character sets to work with text
References
- ASCII (Wikipedia)
- ISO/IEC 8859-1 (Wikipedia)
- Windows-1252 (Wikipedia)
- Character Sets (IANA)
- The Unicode Standard (Unicode Consortium)
- UTF-8 (Wikipedia)
- UTF-16 (Wikipedia)
- UTF-32 (Wikipedia)
- Supported Encodings (Oracle - Tech Notes: Guides)
- Unicode Byte Order Mark FAQ
Resources
- American National Standard Institute (ANSI)
- International Organization for Standardization (ISO)
- Internet Assigned Numbers Authority (IANA)
- Unicode Consortium
- BabelMap Online (BabelStone)
- UniView (r12a)
- Unicode character pickers (r12a)
- Unicode code converter (r12a)
Acknowledgments
- UTF-8 encoding examples table modified from UTF-8 (Wikipedia).
- Some symbols are from Font Awesome by Dave Gandy.