Text I/O

Goals

Understand Java's system of readers and writers.
Learn how to work with byte order marks with readers and writers.

Concepts

adapter pattern
decorator pattern
malformed
reader
unmappable
writer

Library

Dependencies

com.globalmentor:globalmentor-core:0.5.4

Lesson

The basis of file I/O in Java, as you've learned, is formed by the byte-based InputStreams and OutputStreams. But human-readable text, based upon characters represented by the Java char primitive type, must be translated to a series of bytes using a charset; which includes the concepts of a character set, a character coding, and a byte order. As you've experienced from the tasks in previous lessons, manually extracting bytes and converting them to characters using the correct character encoding algorithm can be tedious, not to mention complicated.

Java has provided a set of reader and writer I/O classes for translating between byte and char automatically based upon a charset. The java.io.Reader class is for decoding bytes from an underlying InputStream, and the java.io.Writer class is for encoding bytes to an underlying OutputStream.

Readers

The following reader classes are all in the java.io package.

Reader class diagram. — `Reader` class diagram.

Reader: Abstract class that forms the basis of all input streams.
BufferedReader: Provides buffering of other readers.
CharArrayReader: A reader to an existing array of characters.
FileReader: A direct reader to a file. This class uses the old java.io.File class and should only be used with legacy code.
FilterReader: A simple reader wrapper allowing subclasses to do more processing on data after reading.
InputStreamReader: Central reader for wrapping an input stream and decoding bytes to characters.
LineNumberReader: A reader that keeps track of line numbers.
StringReader: A reader to the characters of an existing string.

Writers

The following writer classes are all in the java.io package.

Writer class diagram. — `Writer` class diagram.

Writer: Abstract class that forms the basis of all writers.
BufferedWriter: Provides buffering of other writers.
CharArrayWriter: A writer to a dynamically managed internal array of characters.
FileWriter: A writer to a file. This class uses the old java.io.File class and should only be used with legacy code.
FilterWriter: A simple writer wrapper allowing subclasses to do more processing on data before writing.
OutputStreamWriter: Central writer for wrapping an output stream and encoding characters to bytes.
StringWriter: A writer to an internal string buffer, which can later be used to produce a string.

Reading and Writing `char`

The biggest distinction between readers and writers on the one hand; and input and output streams on the other is that the former group work in terms of characters rather than byte values. A single character can be read using Reader.read(); this method returns an int, just as does InputStream.read(), and both use the value -1 to indicate the end of the stream. However the range of values of InputStream.read() is that of eight bits of information: 0x00 – 0xFF. The range of values returned by Reader.read() is that of 16 bits of information (i.e. the range of a char): 0x0000 – 0xFFFF.

If the source of the data is already stored in characters, then no conversion to or from bytes has to take place. Here is how to read the characters from an string, for example, using java.io.StringReader:

Reading individual characters from a StringReader using Reader.read().

final String inputString = "touché";
try(final Reader reader = new StringReader(inputString)) {
  int charValue;
  while((charValue = reader.read()) != -1) {  //read the characters
    System.out.println(String.format("U+%04X", charValue));  //U+XXXX
  }
}

Reader is AutoCloseable so we can use it in a try-with-resources statement.

Note the special format string we use with String.format(String format, Object... args) to show the Unicode code point of the character in the form U+XXXX, where XXXX is a four-digit uppercase hex string. Instead of using the typical %s pattern, we use the pattern %X for an uppercase hex string, providing 04 to indicate that the string should be left-zero-padded to make four digits. See Format String Syntax.

Analogous to InputStream.read(byte[] b) there exists Reader.read(char[] cbuff) which reads multiple characters at a time into an existing buffer. Similarly for writing, analogous to OutputStream.write(byte[] b) and OutputStream.write(byte[] b, int off, int len) there exist Writer.write(char[] cbuf) and Writer.write(char[] cbuf, int off, int len), respectively.

Because most of the character sequences you will have are already in String form, methods such as Writer.write(String str) allow you to write strings directly.

You can therefore read and write buffers of characters, and move them between readers and writers similarly to how you would with byte streams.

Copying from a Reader to a Writer using a buffer.

final char[] inputChars = "abcdefghijklmnopqrstuvwxyz".toCharArray();

//create a buffer array for copying up to 16 characters at a time (an arbitrary value)
final char[] buffer = new char[0x10];

//create a destination writer for the characters
final StringWriter stringWriter = new StringWriter();

//copy a buffer at a time until we reach the end of the reader
try {
  try(final Reader reader = new CharArrayReader(inputBytes)) {
    int count;
    while((count = reader.read(buffer)) != -1) {  //-1 indicates end of stream
      stringWriter.write(buffer, 0, count);
    }
  }
} finally {
  baos.close();
}

//print out the string of the characters copied to the Writer
System.out.println(stringWriter.toString());

Adapting a Byte Stream

Just as input and output streams can wrap other input and output streams, some readers and writers can wrap other readers and writers; this is the decorator pattern you learned about already. However the important java.io.InputStreamReader and java.io.OutputStreamWriter classes do not wrap other readers and writers; rather they wrap instances of InputStream and OutputStream, respectively. Calling InputStreamReader.read() for example will call the underlying InputStreamReader.read() to read the correct number of bytes, decoding them (according to the specified charset) to the correct character which is then returned according to the Reader.read() contract. Similarly OutputStreamWriter.write(int c) will convert the input character to the correct number of bytes and write them to the underlying OutputStream based on the specified charset.

By wrapping InputStream and OutputStream, the InputStreamReader and OutputStreamWriter classes effectively convert the InputStream and OutputStream APIs to the Reader and Writer APIs. Converting the API of one class or interface to another by wrapping the converted class and delegating to it in this way is called the adapter pattern.

In order for InputStreamReader and OutputStreamWriter to convert characters to and from byte streams, they must be provided a charset, usually in the form of a java.nio.charset.Charset instance. Let's take the UTF-8 encoding of the string "touché" introduced in the charsets lesson and read the bytes as characters, letting the InputStreamReader take care of converting the bytes for us:

Reading UTF-8 encoded bytes using an InputStreamReader.

final byte[] inputBytes = new byte[] { 0x74, 0x6F, 0x75, 0x63, 0x68, (byte)0xC3, (byte)0xA9 };
try(final Reader reader = new InputStreamReader(new ByteArrayInputStream(inputBytes), StandardCharsets.UTF_8)) {
  int charValue;
  while((charValue = reader.read()) != -1) {
   System.out.println((char)charValue);
  }
}

The values 0xC3 and 0xA9 are both above 0x7F (decimal 127), the highest value a signed byte can hold. By casting these values to byte will cause the values to wrap around; Java will think they are negative numbers if printed out. Still the negative forms, if you recall two's complement, have the same bit representations as the values 0xC3 and 0xA9, which is the important thing when it comes time to be decoded by the InputStreamReader.

Both InputStreamReader and OutputStreamWriter provide constructors that do not require a charset to be indicated. These constructors assume the default character encoding of the system. This is extremely dangerous and should be avoided. Different operating systems and configuration may at any time have different default character encodings, and relying on the default not only prevents your code from running consistently across systems, but likely ignores the actual encoding used by some byte stream which may be completely independent of the system. Always specify a charset when constructing an InputStreamReader or OutputStreamWriter.

InputStreamReader converts bytes to characters using a java.nio.charset.CharsetDecoder for the appropriate charset. What a CharsetDecoder does when the input is malformed (an invalid sequence of bytes for the charset is encountered) or a character is unmappable (does not represent a valid Unicode character) is determined by the specified java.nio.charset.CodingErrorAction. An InputStreamReader by default may use a CharsetDecoder configured with CodingErrorAction.REPLACE, so that malformed input or unmappable characters are simply replaced by some replacement character! If you want to make sure that an exception is thrown if there is malformed input, you'll want to configure your own CharsetDecoder from Charset.newDecoder() to use CodingErrorAction.REPORT so that it will throw a java.nio.charset.CharacterCodingException. You can then use it to create an InputStreamReader using the constructor InputStreamReader(InputStream in, CharsetDecoder dec).

final CharsetDecoder utf8Decoder = StandardCharsets.UTF_8.newDecoder()
    .onMalformedInput(CodingError.REPORT)
    .onUnmappableCharacter(CodingError.REPORT);

The globalmentor-core library listed in the Dependencies contains a utility method com.globalmentor.io.Charsets.newDecoder(Charset charset, CodingErrorAction codingErrorAction) which makes it easier to create a configured encoder for a particular charset.

Byte Order Marks

Unfortunately, although Java's readers and writer do a fine job of converting between bytes and characters based upon charsets, they don't handle byte order marks at all. A BOM will be interpreted as bytes to be converted into characters, even though they should really be used to determine the charset and then discarded. Instead you'll need to detect and write any BOM manually.

Even Oracle's Java compiler javac does not auto-detect the source file encoding! javac uses the platform default charset and considers any BOM an illegal character. This is why in order to support UTF-8 in your Java source files, you must specify the -encoding parameter to javac, or set the project.build.sourceEncoding property in the Maven POM as recommended in these lessons. See javac and JDK Bug JDK-4508058: UTF-8 encoding does not recognize initial BOM, which was closed with the resolution Won't Fix.

Remember that InputStreamReader and OutputStreamWriter are adapters of existing InputStream and OutputStream instances; any characters read and written will be translated back and forth to bytes. Because the readers and writers read and write bytes to wrapped streams using normal calls, there is nothing to stop you to read or write bytes to the streams before wrapping them in a reader or writer!

Reading BOMs

One approach for auto-detecting a charset is to read bytes from the underlying InputStream to see if they constitute a BOM before creating the InputStreamReader that wraps around it. The logic might look like this:

Outline for determining a charset by detecting a BOM when creating a Reader.

public class BOMCharsetDetectInputStreamReader extends InputStreamReader

  public BOMCharsetDetectInputStreamReader(@Nonnull final InputStream inputStream) throws IOException {
    super(inputStream, detectCharset(inputStream));  //detect charset on the fly during construction
  }

  private static Charset detectCharset(final @Nonnull InputStream inputstream) throws IOException
    //TODO make sure input stream supports mark/reset
    inputStream.mark(4);  //reserve enough room for the largest BOM we support
    final byte[] buffer;
    //TODO read two bytes into the buffer if possible
    if(…) {  //TODO see if it is one of the UTF-16 BOMs; if so
      return …;  //return StandardCharsets.UTF-16BE or StandardCharsets.UTF-16LE
    }
    //TODO read another byte into the buffer if possible
    if(…) {  //TODO see if it is the UTF-8 BOM; if so
      return StandardCharsets.UTF_8;
    }
    //TODO read one more byte into the buffer if possible
    if(…) {  //TODO see if it is one of the UTF-32 BOMs; if so
      return …;  //return the correct UTF-32BE or UTF-32LE charset
    }
    //TODO if nothing matches, assume that this was UTF-8 with no BOM
    inputStream.reset();    //don't forget to put the bytes back---they weren't a BOM!
    return StandardCharsets.UTF_8;
  }

}

It's not a good idea to do all that I/O processing in the constructor with the potential of throwing an IOException. A better implementation would be to save the InputStream in the constructor, and wrap an internal Reader initially set to null that would be lazily created. Only on the first read would the BOM be detected and the Reader created to be used for future reads.

The globalmentor-core library listed in the Dependencies contains the com.globalmentor.io.BOMInputStreamReader class which performs auto-detection of charset based on the presence of a BOM. By default BOMInputStreamReader uses a CharsetDecoder configured with CodingErrorAction.REPORT so that an exception will be thrown if there is malformed input or unmappable characters.

Writing BOMs

Similarly an OutputStreamWriter knows how to translate from characters to the correct byte sequence based on a charset, but it will not prepend the bytes with any byte order mark. Creating a Writer to do this might appear like this:

Outline for prepending an output stream with the correct BOM when creating a Writer.

public class BOMOutputStreamWriter extends OutputStreamWriter {

  public BOMOutputStreamWriter(final @Nonnull OutputStream outputStream,
      final @Nonnull Charset charset) throws IOException {
    outputStream.write(getBOM(charset));  //write the BOM to the underlying output stream
  }

  private static byte[] getBOM(final @Nonnull final Charset charset) {
    //TODO return the correct BOM byte sequence for the given charset
  }

}

Buffered Readers and Writers

Analogous to BufferedInputStream and BufferedOutputStream, Java comes with java.io.BufferedReader and java.io.BufferedWriter classes for wrapping an existing reader or writer. These classes provide buffering at the character level rather than the byte level.

Buffered reading from a text file using a BufferedReader.

final static Path path = Paths.get("/etc/foo/bar.txt")
try(final Reader reader = new BufferedReader(new InputStreamReader(Files.newInputStream(path), StandardCharsets.UTF_8))) {
  int charValue;
  while((charValue = reader.read()) != -1) {
   System.out.println((char)charValue);
  }
}

You should always use a BufferedReader or BufferedWriter when working directly with a relatively slow data source. But there is no point to add another buffering layer for a reader or writer refers to in-memory data, such as a CharArrayReader, as such readers and writers are essentially buffered by definition. Similarly if the reader or writer is wrapping a byte stream that is already buffered such as BufferedInputStream or BufferedOutputStream, there is usually no point to adding yet another layer of buffering at the character I/O level.

Review

Gotchas

Using an InputStreamReader or OutputStreamWriter constructor that doesn't specify a charset will use the default system charset, which is completely unknown ahead of time and will cause your code to run differently across systems, likely causing data corruption.
By default InputStreamReader will replace invalid byte sequences and characters rather than reporting an error. If you want to make sure the bytes are valid, you'll need to configure a CharsetDecoder appropriately using CodingErrorAction.REPORT to throw an exception for malformed input or unmappable characters.

In the Real World

You should allow for the possible presence of a BOM in all your code that converts an input stream of bytes to text for processing. If you never encounter a text document starting with a BOM, no harm will be done; but if you neglect to account for a BOM, you will break somebody's workflow down the road.

Think About It

What is the difference in the range of possible values returned by InputStream.read() and Reader.read()?

Self Evaluation

How are readers and writers different from input / output streams?
How would you convert an input or output stream to a reader or a writer?
How default readers and writers provided handle Byte Order Marks?

Task

Your existing IO.readText(InputStream) class fakes UTF-8 support by making sure all bytes in the input stream are values within the ASCII range. Upgrade the method to have true UTF-8 support using InputStreamReader.

Make sure the input stream supports mark/reset. Remember that if it doesn't you can always wrap it in BufferedInputStream.
Read the first few bytes, and if it is the UTF-8 signature, throw them away. Otherwise, reset the stream so that the actual content bytes aren't lost, and assume the content is encoded in UTF-8.
(optional) If you like you can also check for the various UTF-16 and UTF-32 BOMs. Otherwise, be sure to document that the method only supports UTF-8.
Create a CharsetDecoder for the charset and configure it to use CodingErrorAction.REPORT to report illegal byte sequences or invalid code points.
Create an InputStreamReader with the appropriate charset using the CharsetDecoder.
Read and return the text.

Create unit tests with and without BOMs.
Create unit tests with code points in the ASCII range.
Create unit tests with code points in the ISO-8859-1 range, such as in the string "touché".
Create unit tests with code points over 0xFF, such as the Hindi letter म (ma) which has code point U+092E.

You can use the online Unicode code converter to find the UTF-8 representations of Unicode code points.

Resources

Acknowledgments

Some symbols are from Font Awesome by Dave Gandy.

Text I/O

Goals

Concepts

Library

Dependencies

Lesson

Readers

Writers

Reading and Writing char

Adapting a Byte Stream

Byte Order Marks

Reading BOMs

Writing BOMs

Buffered Readers and Writers

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

Resources

Acknowledgments

Reading and Writing `char`