Text I/O
Goals
- Understand Java's system of readers and writers.
- Learn how to work with byte order marks with readers and writers.
Concepts
- adapter pattern
- decorator pattern
- malformed
- reader
- unmappable
- writer
Library
java.lang.AutoCloseable
java.lang.String.format(String format, Object... args)
java.io.BufferedReader
java.io.BufferedWriter
java.io.CharArrayReader
java.io.CharArrayWriter
java.io.File
java.io.FileReader
java.io.FileWriter
java.io.FilterReader
java.io.FilterWriter
java.io.InputStreamReader
java.io.InputStreamReader.InputStreamReader(InputStream in, Charset cs)
java.io.InputStreamReader.InputStreamReader(InputStream in, CharsetDecoder dec)
java.io.InputStreamReader.read()
java.io.LineNumberReader
java.io.OutputStreamWriter
java.io.OutputStreamWriter.write(int c)
java.io.Reader
java.io.Reader.read()
java.io.StringReader
java.io.StringWriter
java.io.Writer
java.io.Writer.write(char[] cbuf)
java.io.Writer.write(char[] cbuf, int off, int len)
java.io.Writer.write(String str)
java.nio.charset.Charset
java.nio.charset.Charset.newDecoder()
java.nio.charset.CharsetDecoder
java.nio.charset.CodingErrorAction
java.nio.charset.CodingErrorAction.REPLACE
java.nio.charset.CodingErrorAction.REPORT
java.nio.charset.CharacterCodingException
com.globalmentor.io.BOMInputStreamReader
com.globalmentor.io.Charsets.newDecoder(Charset charset, CodingErrorAction codingErrorAction)
Dependencies
Lesson
The basis of file I/O in Java, as you've learned, is formed by the byte-based InputStream
s and OutputStream
s. But human-readable text, based upon characters represented by the Java char
primitive type, must be translated to a series of bytes using a charset; which includes the concepts of a character set, a character coding, and a byte order. As you've experienced from the tasks in previous lessons, manually extracting bytes and converting them to characters using the correct character encoding algorithm can be tedious, not to mention complicated.
Java has provided a set of reader and writer I/O classes for translating between byte
and char
automatically based upon a charset. The java.io.Reader
class is for decoding bytes from an underlying InputStream
, and the java.io.Writer
class is for encoding bytes to an underlying OutputStream
.
Readers
The following reader classes are all in the java.io
package.
Reader
- Abstract class that forms the basis of all input streams.
BufferedReader
- Provides buffering of other readers.
CharArrayReader
- A reader to an existing array of characters.
FileReader
- A direct reader to a file. This class uses the old
java.io.File
class and should only be used with legacy code. FilterReader
- A simple reader wrapper allowing subclasses to do more processing on data after reading.
InputStreamReader
- Central reader for wrapping an input stream and decoding bytes to characters.
LineNumberReader
- A reader that keeps track of line numbers.
StringReader
- A reader to the characters of an existing string.
Writers
The following writer classes are all in the java.io
package.
Writer
- Abstract class that forms the basis of all writers.
BufferedWriter
- Provides buffering of other writers.
CharArrayWriter
- A writer to a dynamically managed internal array of characters.
FileWriter
- A writer to a file. This class uses the old
java.io.File
class and should only be used with legacy code. FilterWriter
- A simple writer wrapper allowing subclasses to do more processing on data before writing.
OutputStreamWriter
- Central writer for wrapping an output stream and encoding characters to bytes.
StringWriter
- A writer to an internal string buffer, which can later be used to produce a string.
Reading and Writing char
The biggest distinction between readers and writers on the one hand; and input and output streams on the other is that the former group work in terms of characters rather than byte values. A single character can be read using Reader.read()
; this method returns an int
, just as does InputStream.read()
, and both use the value -1
to indicate the end of the stream. However the range of values of InputStream.read()
is that of eight bits of information: 0x00
– 0xFF
. The range of values returned by Reader.read()
is that of 16 bits of information (i.e. the range of a char
): 0x0000
– 0xFFFF
.
If the source of the data is already stored in characters, then no conversion to or from bytes has to take place. Here is how to read the characters from an string, for example, using java.io.StringReader
:
Analogous to InputStream.read(byte[] b)
there exists Reader.read(char[] cbuff)
which reads multiple characters at a time into an existing buffer. Similarly for writing, analogous to OutputStream.write(byte[] b)
and OutputStream.write(byte[] b, int off, int len)
there exist Writer.write(char[] cbuf)
and Writer.write(char[] cbuf, int off, int len)
, respectively.
You can therefore read and write buffers of characters, and move them between readers and writers similarly to how you would with byte streams.
Adapting a Byte Stream
Just as input and output streams can wrap other input and output streams, some readers and writers can wrap other readers and writers; this is the decorator pattern you learned about already. However the important java.io.InputStreamReader
and java.io.OutputStreamWriter
classes do not wrap other readers and writers; rather they wrap instances of InputStream
and OutputStream
, respectively. Calling InputStreamReader.read()
for example will call the underlying InputStreamReader.read()
to read the correct number of bytes, decoding them (according to the specified charset) to the correct character which is then returned according to the Reader.read()
contract. Similarly OutputStreamWriter.write(int c)
will convert the input character to the correct number of bytes and write them to the underlying OutputStream
based on the specified charset.
In order for InputStreamReader
and OutputStreamWriter
to convert characters to and from byte streams, they must be provided a charset, usually in the form of a java.nio.charset.Charset
instance. Let's take the UTF-8 encoding of the string "touché"
introduced in the charsets lesson and read the bytes as characters, letting the InputStreamReader
take care of converting the bytes for us:
Byte Order Marks
Unfortunately, although Java's readers and writer do a fine job of converting between bytes and characters based upon charsets, they don't handle byte order marks at all. A BOM will be interpreted as bytes to be converted into characters, even though they should really be used to determine the charset and then discarded. Instead you'll need to detect and write any BOM manually.
Reading BOMs
One approach for auto-detecting a charset is to read bytes from the underlying InputStream
to see if they constitute a BOM before creating the InputStreamReader
that wraps around it. The logic might look like this:
Writing BOMs
Similarly an OutputStreamWriter
knows how to translate from characters to the correct byte sequence based on a charset, but it will not prepend the bytes with any byte order mark. Creating a Writer
to do this might appear like this:
Buffered Readers and Writers
Analogous to BufferedInputStream
and BufferedOutputStream
, Java comes with java.io.BufferedReader
and java.io.BufferedWriter
classes for wrapping an existing reader or writer. These classes provide buffering at the character level rather than the byte level.
Review
Gotchas
- Using an
InputStreamReader
orOutputStreamWriter
constructor that doesn't specify a charset will use the default system charset, which is completely unknown ahead of time and will cause your code to run differently across systems, likely causing data corruption. - By default
InputStreamReader
will replace invalid byte sequences and characters rather than reporting an error. If you want to make sure the bytes are valid, you'll need to configure aCharsetDecoder
appropriately usingCodingErrorAction.REPORT
to throw an exception for malformed input or unmappable characters.
In the Real World
- You should allow for the possible presence of a BOM in all your code that converts an input stream of bytes to text for processing. If you never encounter a text document starting with a BOM, no harm will be done; but if you neglect to account for a BOM, you will break somebody's workflow down the road.
Think About It
- What is the difference in the range of possible values returned by
InputStream.read()
andReader.read()
?
Self Evaluation
- How are readers and writers different from input / output streams?
- How would you convert an input or output stream to a reader or a writer?
- How default readers and writers provided handle Byte Order Marks?
Task
Your existing IO.readText(InputStream)
class fakes
UTF-8 support by making sure all bytes in the input stream are values within the ASCII range. Upgrade the method to have true UTF-8 support using InputStreamReader
.
- Make sure the input stream supports mark/reset. Remember that if it doesn't you can always wrap it in
BufferedInputStream
. - Read the first few bytes, and if it is the UTF-8 signature, throw them away. Otherwise, reset the stream so that the actual content bytes aren't lost, and assume the content is encoded in UTF-8.
- (optional) If you like you can also check for the various UTF-16 and UTF-32 BOMs. Otherwise, be sure to document that the method only supports UTF-8.
- Create a
CharsetDecoder
for the charset and configure it to useCodingErrorAction.REPORT
to report illegal byte sequences or invalid code points. - Create an
InputStreamReader
with the appropriate charset using theCharsetDecoder
. - Read and return the text.
- Create unit tests with and without BOMs.
- Create unit tests with code points in the ASCII range.
- Create unit tests with code points in the
ISO-8859-1
range, such as in the string"touché"
. - Create unit tests with code points over
0xFF
, such as the Hindi letter
(ma) which has code pointम
U+092E
.
You can use the online Unicode code converter to find the UTF-8 representations of Unicode code points.
Resources
- BabelMap Online (BabelStone)
- UniView (r12a)
- Unicode character pickers (r12a)
- Unicode code converter (r12a)
Acknowledgments
- Some symbols are from Font Awesome by Dave Gandy.