XML

Goals

Learn about text files as a data file format.
Understand the structure of an XML document.
Know how to process an XML document in Java using the DOM.
Be able to serialize an XML DOM tree to a file.

Concepts

attribute
binary file format
character data
character reference
content
delimiters
Document Object Model (DOM)
document type declaration
document type definition (DTD)
element
empty-element tag
end tag
entity reference
event-driven parser
Extensible Markup Language (XML)
HyperText Markup Language (HTML)
internal DTD subset
markup
namespace
namespace prefix
parser
parsing
public identifier
Scalable Vector Graphics (SVG)
schema
serialization
Simple API for XML (SAX)
start-tag
syntax
system identifier
tags
text file format
tree-based parser
verbose
vocabulary
well-formed
whitespace
Web Hypertext Application Technology Working Group (WHATWG)
World Wide Web Consortium (W3C)
XHTML
XML declaration
XML Schema

Library

Preview

Example XML storing address information.

<?xml version="1.0" encoding="UTF-8"?>
<address type="work" since="2016">
  <street>123 Main Street</street>
  <city>Anytown</city>
  <state>OK</state>
  <postalCode>00000</postalCode>
  <country>USA</country>
</address>

Lesson

You've now have had first-hand experience about how directly working with bytes in files can be tricky. Even with using Java's serialization classes, the resulting file format is inflexible. Opening a file containing serialized classes appears to humans to be little more than a mass of arbitrary byte values, unless you're trained in the specific format used by Java serialization.

Editing byte-based files by hand is even more difficult. The layout of the values cannot be changed; they rely on a specific order usually hard-coded into the program that reads and writes them. It is complicated to document these binary file formats and harder still to gain compliance across implementations.

For these reasons text file formats, which store information in human-readable files yet have a structure understandable by computers, have almost completely replaced binary formats for general configuration and sharing of small data sets. Their popularity stems from several benefits:

Text file formats can be edited by a normal text editor.
Their structure is flexible.
- Usually whitespace such as spaces, tabs, and newlines can be added without changing the meaning of the content.
- Sometimes even the order of items in the file can be altered without changing the meaning of the data.

Text file formats usually specify a syntax for how information is arranged in the file. Usually the syntax will use certain characters are delimiters to separate one piece of information from another. Recognizing the delimiters and extracting the relevant information based on the syntax is called parsing.

XML

Historically the most widespread text file format, still in use and ubiquitously supported, is the Extensible Markup Language (XML), which is standardized and maintained by the World Wide Web Consortium (W3C). XML allows data to be separated into tags with identifier names. A parser will be able to extract the data and return a tree of nodes containing the data and the names given to each value.

Example XML file for storing information about a vehicle.

<car vin="123456789">
  <color>blue</color>
  <type>
    <make>Camaro</make>
    <model>Z28</model>
  </type>
</car>

XML comes from a tradition of markup languages oriented towards storing human language text rather than general data. Formats such as the HyperText Markup Language (HTML), a cousin of XML, use tags to mark up a portion of text which can be used by a computer for better understanding what was written as well as for displaying different text styles to the user. The following excerpt uses XHTML, a variation of HTML that is compliant with XML.

…
<p>To process <abbr title="Extensible Markup Language">XML</abbr> a computer must <dfn>parse</dfn>
the XML document, but still the computer may not know what the resulting tags <em>mean</em>
unless it is familiar with the XML vocabulary being used.</p>
…

Not every HTML documents is an XML document—only those that follow the special rules of XML.

XML Declaration

An XML document should begin with an XML declaration, which indicates the version of XML and optionally the charset of the document. A typical XML declaration looks like this:

Typical XML declaration.

<?xml version="1.0" encoding="UTF-8"?>

The values in an XML declaration may be surrounded by quotation mark " characters or single quote ' characters.

Document Type Declaration

After the XML declaration, an XML document may have a document type declaration identifying the schema prescribing the vocabulary and other constraints of some type of XML document. Originally such a grammar was defined in a separate document type definition (DTD), although nowadays other types of definition files may be used.

A document type declaration is placed between <!DOCTYPE and > delimiters. The name after the opening delimiter characters would indicate the name of the outer element, as explained below. Traditionally the doctype would then indicate the public identifier used to identify the DTD across all documents. The doctype would also indicate a system identifier indicating from where the parser could load the DTD if it wanted to validate the document. Most of the time a doctype declaration can be merely copied and pasted from the specification for the XML format you are using.

For example here is a doctype indicating that an XML document contains Scalable Vector Graphics (SVG) markup for SVG 1.1:

Document type declaration for SVG 1.1.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns="http://www.w3.org/2000/svg">
  …
</svg>

Some XML documents might include an internal DTD subset in which the part of the document type definition is contained between bracket [ and ] characters within the XML document itself.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  …
]>
<foo>

</foo>

The latest version of HTML, named HTML5, provides for an XML representation (XHTML) but unlike previous versions does not provide a specific DTD for XHTML validation. Instead the simplified HTML5 doctype declaration is used to signal that the XML document contains HTML information.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  …
<html>

Comments

A comment in XML takes the form  and can span multiple lines. Comment text must not contain the a sequence of two hyphen -- characters.

Elements

The primary XML structure for delimiting information is an element, which consists of a start-tag and and end-tag. A start-tag contains an identifier enclosed in angle bracket < and > characters, such as <foo>. The end-tag contains the same identifier, prefixed with a forward slash / character, such as </foo>. The element content between the tags may be character data (text that is not markup), other elements, or both.

Example XML elements for storing address information.

<?xml version="1.0" encoding="UTF-8"?>
<address> <!-- start-tag for element containing mixed content -->
  <street>123 Main Street</street> <!-- element containing character content -->
  <city>Anytown</city>
  <state>OK</state>
  <postalCode>00000</postalCode>
  <country>USA</country>
</address> <!-- end-tag -->

If an element has no content, it may be represented as an empty-element tag or self-closing tag in which a single tag includes a slash / character after the identifier, such as <foo/> (instead of <foo></foo>).

You may see empty elements being represented with a space before the slash character, especially for the HTML <br /> element. Early web browsers would get confused with a self-closing tag with no internal whitespace, and the practice has persisted though browser have long-since supported the more compact form. See XHTML1, Appendix C. HTML Compatibility Guidelines.

Attributes

Each element start-tag can optionally have several name-value pairs called attributes. Each attribute name is separated from its value by the equals = sign. The value itself must be enclosed either in paired quotation mark " characters, such as foo="bar". The attribute value alternatively appear in single quote ' characters, such as foo='bar'.

XML element for an address information, with a attributes.

<?xml version="1.0" encoding="UTF-8"?>
<address type="work" since="2016">
  <street>123 Main Street</street>
  <city>Anytown</city>
  <state>OK</state>
  <postalCode>00000</postalCode>
  <country>USA</country>
</address>

Character References

A character reference is simply a way to include XML character data by indicating a character's Unicode code point. A character reference begins with the characters &# and ends with a semicolon ; character. The Unicode code point is provided either as decimal value or, if preceded by an x, a hexadecimal value. When parsed a character reference produced the character it represents. In other words, including म is no different, once the document is parsed, than simply having entered the Hindi character म from the beginning.

Entity References

An entity reference it the preferred way to escape character content that would otherwise be considered markup, such as using & to include a literal ampersand & character.

An entity reference looks similar to a character reference except that it is missing the number # sign. An entity reference stands for a character or a sequence of characters, identified by the the entity reference name rather than a character's Unicode code point. For example the entity reference < is equivalent to including the less than < character as XML character content.

Normally an entity must be declared in a DTD, indicating which character(s) it represents, before it can be referenced by name. However there are five predefined entities that are guaranteed to be recognized by every XML parser and thus may be used in any XML document, even if it has no DTD:

Predefined Entities

Entity	Value
`&`	`&`
`<`	`<`
`>`	`>`
`'`	`'`
`"`	`"`

HTML defines many more entities, such as é to represent é, and these are available in every HTML document unless the document is parsed as XML, that is an XHTML document. If an HTML document is parsed as XML, these HTML entities will not be available unless they are defined as required by XML. Do not use special HTML entity references in your HTML documents, in case the content ever needs to be parsed as XML.

It is possible to define entities inside an internal DTD subset. The following defines the entity reference &lol; so that it will be replaced with the word haha followed Unicode U+1F602, FACE WITH TEARS OF JOY.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY lol "haha 😂">
]>
<foo>
  …
</foo>

XML Namespaces

The element and attribute names available to describe some data in XML is called a vocabulary, and at times it may be useful to use elements from multiple vocabularies in a single XML document. Because different vocabularies may use the same names for different meanings, XML provides a way to separate the vocabularies into different namespaces.

Each namespace is identified by a URI, a broader term for the URLs used when browsing the web. The easiest and most common approach for using namespaces is to set the default namespace for all the elements in the document by using the xmlns attribute on the root element. For example an XHTML will indicate the default namespace http://www.w3.org/1999/xhtml on the root <html> element.

Declaring a default namespace in an XHTML5 document.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>Example XHTML Document</title></head>
  <body>
    …

A specific namespace can be explicitly assigned to certain elements and/or attributes by using a namespace prefix. First the prefix is defined using an attribute in the form xmlns:prefix="namespace-uri", where prefix is the prefix to define and namespace-uri is the URI identifying the namespace. Then to reference a namespace, use the prefix with the element or attribute name, separated by a colon : character.

The Maven POM, which you have been using since some of the earliest lessons, provides an example of XML namespaces. The Maven POM namespace itself is declared as the default using xmlns="http://maven.apache.org/POM/4.0.0". A separate namespace for XML Schema is associated with the xsi namespace prefix using xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance". Lastly the XML Schema namespace is used to specify the location of the schema for the POM vocabulary using xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd".

Declaring namespaces in a Maven POM.

<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  …

Parsing XML

DOM tree for an HTML document. — DOM tree of an HTML document.

To read XML so that it can be processed you must use a parser. There are two broad types of parsers for XML. The Simple API for XML (SAX) is an event-driven parser; rather than loading the entire file in memory at once, each type the parser encounters elements, attributes, and other markup this type of parser will call methods in a handler you specify. Event-driver parsers are very memory efficient, but you are forced to process the file sequentially in the order markup occurs in the document.

A tree-based parser on the other hand reads the entire XML document and creates a tree data structure in memory representing elements, attributes, other markup, and text content. You can then access various portions of the tree in any order as needed, searching for relevant information. The trees returned by most Java XML parsers adhere to the Document Object Model (DOM), a set of interfaces standardized by the W3C for navigating the parsed nodes in an XML tree.

Here we will use a DOM-based parser for processing XML information, which are accessed using classes in the javax.xml.parsers package. Java supports the DOM through the built-in org.w3c.dom package.

Getting a DOM Parser

The first step to parsing is retrieving an implementation of a DOM parser, referred to as javax.xml.parsers.DocumentBuilder, via a javax.xml.parsers.DocumentBuilderFactory. The factory retrieved by DocumentBuilderFactory.newInstance() can be configured to produce a validating parser and/or one that supports namespaces. Once the factory is appropriately configured, the parser may be retrieved using DocumentBuilderFactory.newDocumentBuilder().

Retrieving an XML DOM parser implementation.

final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(false); //optional; only if needed
documentBuilderFactory.setValidating(false); //optional; only if needed
final DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

Parsing the XML Document

After getting a parser, you must tell it to parse the your document. DocumentBuilder has methods for parsing from various sources, including from a String and Reader, but the most common method is DocumentBuilder.parse(InputStream is), which parses an XML document from an InputStream. All the DocumentBuilder.parse(…) methods return an instance of org.w3c.dom.Document, which represents the DOM of your XML.

Parsing an XML document from a file using a DocumentBuilder.

…
final Document document;
final Path path = Paths.get("/etc/foo/bar.xml");
try(final InputStream inputStream = new BufferedInputStream(Files.newInputStream(path))) {
  document = documentBuilder.parse(inputStream);
}
//no need to keep the input stream open after parsing the document

You'll notice that the parsing methods discussed so far are specific to Java, living in the javax.xml.parsers package. The DOM Level 3 Load and Save Specification introduced a pure DOM approach for parsing XML using classes in the org.w3c.dom.ls package. First access a org.w3c.dom.bootstrap.DOMImplementationRegistry to retrieve a org.w3c.dom.ls.DOMImplementationLS, which works as a factory to create the actual org.w3c.dom.ls.LSParser. You will also need to create a special org.w3c.dom.ls.LSInput object to represent the input stream or reader.

final DOMImplementationRegistry domImplementationRegistry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS domImplementationLS = (DOMImplementationLS)domImplementationRegistry.getDOMImplementation("LS");
final LSParser lsParser = domImplementationLS.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
final Document document;
final Path path = Paths.get("/etc/foo/bar.xml");
try(final InputStream inputStream = new BufferedInputStream(Files.newInputStream(path))) {
  final LSInput lsInput = domImplementationLS.createLSInput();
  lsInput.setByteStream(inputStream);
  document = lsParser.parse(lsInput);
}
//no need to keep the input stream open after parsing the document

Using a LSParser is slightly more involved than using a DocumentBuilder but uses only DOM methods. Either approach should provide a Document instance that can be traversed and processed.

Traversing the DOM

Every node in the resulting tree produced by the XML parser is represented by an org.w3c.dom.Node. The specific type of node can be retrieved using Node.getNodeType() which returns a short integer value indicating a node type; the value will be one of the Node defined constants such as Node.ELEMENT_NODE or Node.TEXT_NODE. The base Node type comes with many methods such as Node.getNodeName() that apply to most nodes, as well as methods such as Node.getAttributes() that apply only to certain node types. Although you could work with most nodes using the the general Node methods, it is usually easier upon discovering the node type to cast the Node to the appropriate subtype such as org.w3c.dom.Element and access the specialized methods such as Element.getAttribute(String name) it provides.

According to the W3C DOM Levels 1, 2, and 3, the method Element.getAttribute(String name) returns an empty string even if the attribute value is missing! (You can find out if the attribute is actually present using Element.hasAttribute(String name).) The browser world, however, has been following improvements by the Web Hypertext Application Technology Working Group (WHATWG), which is the group now actively maintaining the DOM specification. The latest WHATWG versions of the DOM now say that Element.getAttribute(String name) should return null if the named attribute is not present, and this is what most browsers will return if their DOM is accessed via JavaScript. The W3C has followed suit in W3C DOM4. See What's the story with Element.getAttribute() for missing attributes?

Text within the DOM requires special care. Each sequence of non-markup text within an element is represented by a org.w3c.dom.Text node (of type Node.TEXT_NODE). This means that even if an element only appears to contain other elements, the end-of-line characters and indentation whitespace will be stored as Text nodes! Consider the following simple XML document:

Example XML document to traverse using the DOM.

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>test</bar>
</foo>

Assuming the newline "\n" character was used to end each line, and that the tab "\t" character was used for indentation, when the document is parsed the <foo> element when parsed will contain three children:

Element: <foo>
- Text: "\n\t"
- Element: <bar>
  - Text: "test"
- Text: "\n"

The XML document itself is represented by a org.w3c.dom.Document node. The root element of the XML document structure, however, is retrieved by Document.getDocumentElement(). From there child nodes can be traversed using the org.w3c.dom.NodeList returned by Node.getChildNodes(). Unfortunately NodeList does not implement the standard java.util.List interface and thus must be iterated using NodeList.getLength() and NodeList.item(int index) as shown below.

Traversing a simple XML document using the DOM.

<?xml version="1.0" encoding="UTF-8"?>
<food>
  <vegetable color="green">lettuce</vegetable>
  <fruit color="red">apple</fruit>
  <vegetable color="red">tomato</vegetable>
</food>

…
final Element rootElement = document.getDocumentElement();
if(!rootElement.getNodeName().equals("food")) {
	throw new IOException("Unrecognized root element name.");
}
//print out the red food
final NodeList childNodes = rootElement.getChildNodes();
for(int i = 0; i < childNodes.getLength(); ++i) {
	final Node childNode = childNodes.item(i);
	if(childNode.getNodeType() == Node.ELEMENT_NODE) {
		final Element foodElement=(Element)childNode;
		if(foodElement.getAttribute("color").equals("red")) {
			final String foodType = foodElement.getNodeName();
			final String foodName = foodElement.getTextContent();
			System.out.println(String.format("%s (%s)", foodName, foodType));
		}
	}
}

Modifying the DOM

In addition to traversing the nodes of a DOM, you can also change the structure of the tree. The most useful methods are Element.setAttribute(String name, String value), which sets an attribute value for an element; and Node.appendChild(Node newChild), which adds a child node (such as an Element) to any Node (including an Element). The Document instance itself functions as a factory to produce new nodes, such as Document.createElement(String tagName) to create an Element node.

To add text content to an element, you must use Node.appendChild(Node newChild) to append a child node of type Text containing the correct text content. A Text node is created using Document.createTextNode(String data).

Creating a DOM Instance

Not only can you change an existing DOM tree, you can create an entire DOM instance from scratch in memory. A org.w3c.dom.DOMImplementation, which represents the specific XML implementation configured in your JVM, acts as factory to create new documents. You can retrieve an instance of DOMImplementation from the DocumentBuilder you retrieved above by using DocumentBuilder.getDOMImplementation().

Once you have a DOMImplementation, calling DOMImplementation.createDocument(String namespaceURI, String qualifiedName, DocumentType doctype) will create a new document, which you can then modify as described above. The qualified name indicates the name to use for the root element, which you can then retrieve using Document.getDocumentElement() as you would normally do when traversing the tree. You may provide null for both the namespace URI and the doctype if you want to create a simple document without namespaces or a doctype.

Creating a simple XML document <foo>bar</foo> using the DOM.

final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final DOMImplementation domImplementation = documentBuilder.getDOMImplementation();
final Document fooDocument = domImplementation.createDocument(null, "foo", null);  //<foo></foo>
final Text barText = fooDocument.createTextNode("bar");  //"bar"
fooDocument.getDocumentElement().appendChild(barText);  //<foo>bar</foo>

Generating XML

As with generating byte representations of Java objects, producing a text or byte representation of an XML DOM instance is referred to as serialization.

XML Serialization using `Transformer`

Java provides XML serialization capabilities in the javax.xml.transform package. Use a javax.xml.transform.TransformerFactory to retrieve a javax.xml.transform.Transformer. Use Transformer.setOutputProperty(String name, String value) as needed to configure the transformer, using javax.xml.transform.OutputKeys values such as OutputKeys.ENCODING. Finally use Transformer.transform(Source xmlSource, Result outputTarget) to serialize the XML represented by a javax.xml.transform.dom.DOMSource to a javax.xml.transform.stream.StreamResult.

Serializing a DOM instance using a Transformer.

final TransformerFactory transformerFactory = TransformerFactory.newInstance();
final Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.UTF_8.name());
//optional: transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
//optional: transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//optional: transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
final StreamResult streamResult = new StreamResult(new StringWriter());  //TODO write to actual file
final DOMSource domSource = new DOMSource(document);
transformer.transform(domSource, streamResult);
System.out.println(streamResult.getWriter().toString());

A StreamResult can be constructed to serialize to other destinations, such as an existing OutputStream.

XML Serialization using `LSSerializer`

As noted above for parsing, the DOM Level 3 Load and Save Specification provides a pure DOM approach for serializing XML. First access a org.w3c.dom.bootstrap.DOMImplementationRegistry to retrieve a org.w3c.dom.ls.DOMImplementationLS, which works as a factory to create the actual org.w3c.dom.ls.LSSerializer. You will also need to create a special org.w3c.dom.ls.LSOutput object to represent the output stream or writer.

final DOMImplementationRegistry domImplementationRegistry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS domImplementationLS = (DOMImplementationLS)domImplementationRegistry.getDOMImplementation("LS");
final LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
//optional: lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
final LSOutput lsOutput = domImplementationLS.createLSOutput();
lsOutput.setCharacterStream(new StringWriter());  //TODO write to actual file
lsOutput.setEncoding(StandardCharsets.UTF_8.name());
lsSerializer.write(document, lsOutput);
System.out.println(lsOutput.getCharacterStream().toString());

In addition to LSOutput.setCharacterStream(Writer characterStream) for serializing to a Writer, there is also a LSOutput.setByteStream(OutputStream byteStream) method for serializing to an OutputStream.

An LSSerializer seems to produce more natural, formatted output with less configuration than a Transformer. Some report it may be faster as well.

Review

Gotchas

Don't confuse the XML terms valid and well-formed. If a document follows the rules of XML it is well-formed, but is considered valid only if it has been checked against a DTD or some other schema.
HTML predefined entities such as é are not available if the document is parsed as XML, unless the entities are declared as required by XML. With Unicode support in modern text editors, there is little need to use these entities anyway; just enter the literal character you desire.
Don't forget that even if an element appears to only have other elements as child nodes, there may appear text nodes in the DOM representing whitespace such as line endings.

In the Real World

XML has become ubiquitous within the world of computers; there are few languages and platforms that do not have robust XML parsers and other tools available. Recently its use has been supplanted somewhat by newer, simpler formats such as JSON, which you will learn about in a future lesson.

Think About It

Attributes and child elements are two different ways to represent subordinate information information in an XML document. Beyond small details (such as the fact that attributes cannot in turn have children), the semantic choice between the two is largely arbitrary when designing an XML-based format.

Self Evaluation

What are some of the benefits of text file formats over binary ones?
What is the syntax of a text format, and how does that differ from its semantics?
What is the difference between well-formed and valid XML documents?
How would you create an XML element containing child element but that should generate no child next nodes in the parsed DOM tree?

Task

Upgrade the configuration file used to store the Booker application user name to XML format, using the following template:

<config>
  <user>Jane Doe</user>
</config>

Use a file named config.xml in the .booker subdirectory of the user's home directory, replacing the user.txt file you were using in previous lessons.
When the Booker application runs, check to see if there is a config.xml file in the configuration directory.
- If there is no such file, generate one containing the current system user account name. Create a DOM tree manually in memory and then serialize it. You can retrieve the name of the current system logged-in user using System.getProperty(String key) with the key "user.name".
- If that file exists, open and parse it to retrieve the name of the user.
Print the user's name to the console (e.g. Library of Jane Doe) before printing the list of books.

References

Resources

Acknowledgments

DOM tree diagram modified from DOM model by Birger Eriksson (Own work) [CC BY-SA 3.0], via Wikimedia Commons.
Some symbols are from Font Awesome by Dave Gandy.