XML
Goals
- Learn about text files as a data file format.
- Understand the structure of an XML document.
- Know how to process an XML document in Java using the DOM.
- Be able to serialize an XML DOM tree to a file.
Concepts
- attribute
- binary file format
- character data
- character reference
- content
- delimiters
- Document Object Model (DOM)
- document type declaration
- document type definition (DTD)
- element
- empty-element tag
- end tag
- entity reference
- event-driven parser
- Extensible Markup Language (XML)
- HyperText Markup Language (HTML)
- internal DTD subset
- markup
- namespace
- namespace prefix
- parser
- parsing
- public identifier
- Scalable Vector Graphics (SVG)
- schema
- serialization
- Simple API for XML (SAX)
- start-tag
- syntax
- system identifier
- tags
- text file format
- tree-based parser
- verbose
- vocabulary
- well-formed
- whitespace
- Web Hypertext Application Technology Working Group (WHATWG)
- World Wide Web Consortium (W3C)
- XHTML
- XML declaration
- XML Schema
Library
java.lang.System.getProperty(String key)
javax.xml.parsers
javax.xml.parsers.DocumentBuilder
javax.xml.parsers.DocumentBuilder.getDOMImplementation()
javax.xml.parsers.DocumentBuilder.parse(InputStream is)
javax.xml.parsers.DocumentBuilderFactory
javax.xml.parsers.DocumentBuilderFactory.newDocumentBuilder()
javax.xml.parsers.DocumentBuilderFactory.newInstance()
javax.xml.transform
javax.xml.transform.OutputKeys
javax.xml.transform.OutputKeys.ENCODING
javax.xml.transform.Transformer
javax.xml.transform.Transformer.setOutputProperty(String name, String value)
javax.xml.transform.Transformer.transform(Source xmlSource, Result outputTarget)
javax.xml.transform.TransformerFactory
javax.xml.transform.dom.DOMSource
javax.xml.transform.stream.StreamResult
org.xml.sax.SAXException
org.w3c.dom
org.w3c.dom.Document
org.w3c.dom.Document.createElement(String tagName)
org.w3c.dom.Document.getDocumentElement()
org.w3c.dom.Document.createTextNode(String data)
org.w3c.dom.DOMImplementation
org.w3c.dom.DOMImplementation.createDocument(String namespaceURI, String qualifiedName, DocumentType doctype)
org.w3c.dom.Element
org.w3c.dom.Element.getAttribute(String name)
org.w3c.dom.Element.hasAttribute(String name)
org.w3c.dom.Element.setAttribute(String name, String value)
org.w3c.dom.Node
org.w3c.dom.Node.ELEMENT_NODE
org.w3c.dom.Node.TEXT_NODE
org.w3c.dom.Node.appendChild(Node newChild)
org.w3c.dom.Node.getAttributes()
org.w3c.dom.Node.getChildNodes()
org.w3c.dom.Node.getNodeName()
org.w3c.dom.Node.getNodeType()
org.w3c.dom.Node.getTextContent()
org.w3c.dom.NodeList
org.w3c.dom.NodeList.getLength()
org.w3c.dom.NodeList.item(int index)
org.w3c.dom.Text
org.w3c.dom.ls
org.w3c.dom.ls.DOMImplementationLS
org.w3c.dom.ls.LSInput
org.w3c.dom.ls.LSOutput
org.w3c.dom.ls.LSOutput.setByteStream(OutputStream byteStream)
org.w3c.dom.ls.LSOutput.setCharacterStream(Writer characterStream)
org.w3c.dom.ls.LSParser
org.w3c.dom.ls.LSSerializer
Preview
Lesson
You've now have had first-hand experience about how directly working with bytes in files can be tricky. Even with using Java's serialization classes, the resulting file format is inflexible. Opening a file containing serialized classes appears to humans to be little more than a mass of arbitrary byte values, unless you're trained in the specific format used by Java serialization.
Editing byte-based files by hand is even more difficult. The layout of the values cannot be changed; they rely on a specific order usually hard-coded into the program that reads and writes them. It is complicated to document these binary file formats and harder still to gain compliance across implementations.
For these reasons text file formats, which store information in human-readable files yet have a structure understandable by computers, have almost completely replaced binary formats for general configuration and sharing of small data sets. Their popularity stems from several benefits:
- Text file formats can be edited by a normal text editor.
- Their structure is flexible.
- Usually whitespace such as spaces, tabs, and newlines can be added without changing the meaning of the content.
- Sometimes even the order of items in the file can be altered without changing the meaning of the data.
Text file formats usually specify a syntax for how information is arranged in the file. Usually the syntax will use certain characters are delimiters to separate one piece of information from another. Recognizing the delimiters and extracting the relevant information based on the syntax is called parsing.
XML
Historically the most widespread text file format, still in use and ubiquitously supported, is the Extensible Markup Language (XML), which is standardized and maintained by the World Wide Web Consortium (W3C). XML allows data to be separated into tags with identifier names. A parser will be able to extract the data and return a tree of nodes containing the data and the names given to each value.
XML Declaration
An XML document should begin with an XML declaration, which indicates the version of XML and optionally the charset of the document. A typical XML declaration looks like this:
Document Type Declaration
After the XML declaration, an XML document may have a document type declaration identifying the schema prescribing the vocabulary and other constraints of some type of XML document. Originally such a grammar was defined in a separate document type definition (DTD), although nowadays other types of definition files may be used.
A document type declaration is placed between <!DOCTYPE
and >
delimiters. The name after the opening delimiter characters would indicate the name of the outer element, as explained below. Traditionally the doctype would then indicate the public identifier used to identify the DTD across all documents. The doctype would also indicate a system identifier indicating from where the parser could load the DTD if it wanted to validate the document. Most of the time a doctype declaration can be merely copied and pasted from the specification for the XML format you are using.
For example here is a doctype indicating that an XML document contains Scalable Vector Graphics (SVG) markup for SVG 1.1:
Comments
A comment in XML takes the form <!-- … -->
and can span multiple lines. Comment text must not contain the a sequence of two hyphen --
characters.
Elements
The primary XML structure for delimiting information is an element, which consists of a start-tag and and end-tag. A start-tag contains an identifier enclosed in angle bracket <
and >
characters, such as <foo>
. The end-tag contains the same identifier, prefixed with a forward slash /
character, such as </foo>
. The element content between the tags may be character data (text that is not markup), other elements, or both.
Attributes
Each element start-tag can optionally have several name-value pairs called attributes. Each attribute name is separated from its value by the equals =
sign. The value itself must be enclosed either in paired quotation mark "
characters, such as foo="bar"
. The attribute value alternatively appear in single quote '
characters, such as foo='bar'
.
Character References
A character reference is simply a way to include XML character data by indicating a character's Unicode code point. A character reference begins with the characters &#
and ends with a semicolon ;
character. The Unicode code point is provided either as decimal value or, if preceded by an x
, a hexadecimal value. When parsed a character reference produced the character it represents. In other words, including म
is no different, once the document is parsed, than simply having entered the Hindi character म from the beginning.
Entity References
An entity reference looks similar to a character reference except that it is missing the number #
sign. An entity reference stands for a character or a sequence of characters, identified by the the entity reference name rather than a character's Unicode code point. For example the entity reference <
is equivalent to including the less than <
character as XML character content.
Normally an entity must be declared in a DTD, indicating which character(s) it represents, before it can be referenced by name. However there are five predefined entities that are guaranteed to be recognized by every XML parser and thus may be used in any XML document, even if it has no DTD:
Entity | Value |
---|---|
& | & |
< | < |
> | > |
' | ' |
" | " |
XML Namespaces
The element and attribute names available to describe some data in XML is called a vocabulary, and at times it may be useful to use elements from multiple vocabularies in a single XML document. Because different vocabularies may use the same names for different meanings, XML provides a way to separate the vocabularies into different namespaces.
Each namespace is identified by a URI, a broader term for the URLs used when browsing the web. The easiest and most common approach for using namespaces is to set the default namespace for all the elements in the document by using the xmlns
attribute on the root element. For example an XHTML will indicate the default namespace http://www.w3.org/1999/xhtml
on the root <html>
element.
A specific namespace can be explicitly assigned to certain elements and/or attributes by using a namespace prefix. First the prefix is defined using an attribute in the form xmlns:prefix="namespace-uri"
, where prefix
is the prefix to define and namespace-uri
is the URI identifying the namespace. Then to reference a namespace, use the prefix with the element or attribute name, separated by a colon :
character.
The Maven POM, which you have been using since some of the earliest lessons, provides an example of XML namespaces. The Maven POM namespace itself is declared as the default using xmlns="http://maven.apache.org/POM/4.0.0"
. A separate namespace for XML Schema is associated with the xsi
namespace prefix using xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
. Lastly the XML Schema namespace is used to specify the location of the schema for the POM vocabulary using xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
.
Parsing XML
To read XML so that it can be processed you must use a parser. There are two broad types of parsers for XML. The Simple API for XML (SAX) is an event-driven parser; rather than loading the entire file in memory at once, each type the parser encounters elements, attributes, and other markup this type of parser will call methods in a handler
you specify. Event-driver parsers are very memory efficient, but you are forced to process the file sequentially in the order markup occurs in the document.
A tree-based parser on the other hand reads the entire XML document and creates a tree data structure in memory representing elements, attributes, other markup, and text content. You can then access various portions of the tree in any order as needed, searching for relevant information. The trees returned by most Java XML parsers adhere to the Document Object Model (DOM), a set of interfaces standardized by the W3C for navigating the parsed nodes in an XML tree.
Here we will use a DOM-based parser for processing XML information, which are accessed using classes in the javax.xml.parsers
package. Java supports the DOM through the built-in org.w3c.dom
package.
Getting a DOM Parser
The first step to parsing is retrieving an implementation of a DOM parser, referred to as javax.xml.parsers.DocumentBuilder
, via a javax.xml.parsers.DocumentBuilderFactory
. The factory retrieved by DocumentBuilderFactory.newInstance()
can be configured to produce a validating parser and/or one that supports namespaces. Once the factory is appropriately configured, the parser may be retrieved using DocumentBuilderFactory.newDocumentBuilder()
.
Parsing the XML Document
After getting a parser, you must tell it to parse the your document. DocumentBuilder
has methods for parsing from various sources, including from a String
and Reader
, but the most common method is DocumentBuilder.parse(InputStream is)
, which parses an XML document from an InputStream
. All the DocumentBuilder.parse(…)
methods return an instance of org.w3c.dom.Document
, which represents the DOM of your XML.
Traversing the DOM
Every node in the resulting tree produced by the XML parser is represented by an org.w3c.dom.Node
. The specific type of node can be retrieved using Node.getNodeType()
which returns a short
integer value indicating a node type; the value will be one of the Node
defined constants such as Node.ELEMENT_NODE
or Node.TEXT_NODE
. The base Node
type comes with many methods such as Node.getNodeName()
that apply to most nodes, as well as methods such as Node.getAttributes()
that apply only to certain node types. Although you could work with most nodes using the the general Node
methods, it is usually easier upon discovering the node type to cast the Node
to the appropriate subtype such as org.w3c.dom.Element
and access the specialized methods such as Element.getAttribute(String name)
it provides.
Text within the DOM requires special care. Each sequence of non-markup text within an element is represented by a org.w3c.dom.Text
node (of type Node.TEXT_NODE
). This means that even if an element only appears to contain other elements, the end-of-line characters and indentation whitespace will be stored as Text
nodes! Consider the following simple XML document:
Assuming the newline "\n"
character was used to end each line, and that the tab "\t"
character was used for indentation, when the document is parsed the <foo>
element when parsed will contain three children:
Element
:<foo>
Text
:"\n\t"
Element
:<bar>
Text
:"test"
Text
:"\n"
The XML document itself is represented by a org.w3c.dom.Document
node. The root element of the XML document structure, however, is retrieved by Document.getDocumentElement()
. From there child nodes can be traversed using the org.w3c.dom.NodeList
returned by Node.getChildNodes()
. Unfortunately NodeList
does not implement the standard java.util.List
interface and thus must be iterated using NodeList.getLength()
and NodeList.item(int index)
as shown below.
Modifying the DOM
In addition to traversing the nodes of a DOM, you can also change the structure of the tree. The most useful methods are Element.setAttribute(String name, String value)
, which sets an attribute value for an element; and Node.appendChild(Node newChild)
, which adds a child node (such as an Element
) to any Node
(including an Element
). The Document
instance itself functions as a factory to produce new nodes, such as Document.createElement(String tagName)
to create an Element
node.
Creating a DOM Instance
Not only can you change an existing DOM tree, you can create an entire DOM instance from scratch in memory. A org.w3c.dom.DOMImplementation
, which represents the specific XML implementation configured in your JVM, acts as factory to create new documents. You can retrieve an instance of DOMImplementation
from the DocumentBuilder
you retrieved above by using DocumentBuilder.getDOMImplementation()
.
Once you have a DOMImplementation
, calling DOMImplementation.createDocument(String namespaceURI, String qualifiedName, DocumentType doctype)
will create a new document, which you can then modify as described above. The qualified name
indicates the name to use for the root element, which you can then retrieve using Document.getDocumentElement()
as you would normally do when traversing the tree. You may provide null
for both the namespace URI
and the doctype
if you want to create a simple document without namespaces or a doctype.
Generating XML
As with generating byte representations of Java objects, producing a text or byte representation of an XML DOM instance is referred to as serialization.
XML Serialization using Transformer
Java provides XML serialization capabilities in the javax.xml.transform
package. Use a javax.xml.transform.TransformerFactory
to retrieve a javax.xml.transform.Transformer
. Use Transformer.setOutputProperty(String name, String value)
as needed to configure the transformer, using javax.xml.transform.OutputKeys
values such as OutputKeys.ENCODING
. Finally use Transformer.transform(Source xmlSource, Result outputTarget)
to serialize the XML represented by a javax.xml.transform.dom.DOMSource
to a javax.xml.transform.stream.StreamResult
.
XML Serialization using LSSerializer
As noted above for parsing, the DOM Level 3 Load and Save Specification provides a pure DOM approach for serializing XML. First access a org.w3c.dom.bootstrap.DOMImplementationRegistry
to retrieve a org.w3c.dom.ls.DOMImplementationLS
, which works as a factory to create the actual org.w3c.dom.ls.LSSerializer
. You will also need to create a special org.w3c.dom.ls.LSOutput
object to represent the output stream or writer.
final DOMImplementationRegistry domImplementationRegistry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS domImplementationLS = (DOMImplementationLS)domImplementationRegistry.getDOMImplementation("LS");
final LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
//optional: lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
final LSOutput lsOutput = domImplementationLS.createLSOutput();
lsOutput.setCharacterStream(new StringWriter()); //TODO write to actual file
lsOutput.setEncoding(StandardCharsets.UTF_8.name());
lsSerializer.write(document, lsOutput);
System.out.println(lsOutput.getCharacterStream().toString());
Review
Gotchas
- Don't confuse the XML terms
valid
andwell-formed
. If a document follows the rules of XML it is well-formed, but is considered valid only if it has been checked against a DTD or some other schema. - HTML predefined entities such as
é
are not available if the document is parsed as XML, unless the entities are declared as required by XML. With Unicode support in modern text editors, there is little need to use these entities anyway; just enter the literal character you desire. - Don't forget that even if an element appears to only have other elements as child nodes, there may appear text nodes in the DOM representing whitespace such as line endings.
In the Real World
- XML has become ubiquitous within the world of computers; there are few languages and platforms that do not have robust XML parsers and other tools available. Recently its use has been supplanted somewhat by newer, simpler formats such as JSON, which you will learn about in a future lesson.
Think About It
- Attributes and child elements are two different ways to represent
subordinate
information information in an XML document. Beyond small details (such as the fact that attributes cannot in turn have children), the semantic choice between the two is largely arbitrary when designing an XML-based format.
Self Evaluation
- What are some of the benefits of text file formats over binary ones?
- What is the
syntax
of a text format, and how does that differ from itssemantics
? - What is the difference between
well-formed
andvalid
XML documents? - How would you create an XML element containing child element but that should generate no child next nodes in the parsed DOM tree?
Task
Upgrade the configuration file used to store the Booker application user name to XML format, using the following template:
<config>
<user>Jane Doe</user>
</config>
- Use a file named
config.xml
in the.booker
subdirectory of the user's home directory, replacing theuser.txt
file you were using in previous lessons. - When the Booker application runs, check to see if there is a
config.xml
file in the configuration directory.- If there is no such file, generate one containing the current system user account name. Create a DOM tree manually in memory and then serialize it. You can retrieve the name of the current system logged-in user using
System.getProperty(String key)
with the key"user.name"
. - If that file exists, open and parse it to retrieve the name of the user.
- If there is no such file, generate one containing the current system user account name. Create a DOM tree manually in memory and then serialize it. You can retrieve the name of the current system logged-in user using
- Print the user's name to the console (e.g. Library of Jane Doe) before printing the list of books.
See Also
- XML for the absolute beginner (JavaWorld)
- Namespaces Crash Course (Mozilla Developer Network)
- Lesson: Document Object Model (Oracle - The Java™ Tutorials)
- Reading XML Data into a DOM (Oracle - The Java™ Tutorials)
References
- Extensible Markup Language (XML) 1.0 (Fifth Edition) (W3C)
- Namespaces in XML 1.0 (Third Edition) (W3C)
- Recommended Doctype Declarations to use in your Web document (W3C)
- Document Object Model (DOM) Level 3 Core Specification (W3C)
- Document Object Model (DOM) Level 3 Load and Save Specification (W3C)
- DOM Standard (WHATWG)
Resources
- World Wide Web Consortium (W3C)
- Document Object Model (DOM)
- Web Hypertext Application Technology Working Group (WHATWG)
Acknowledgments
- DOM tree diagram modified from DOM model by Birger Eriksson (Own work) [CC BY-SA 3.0], via Wikimedia Commons.
- Some symbols are from Font Awesome by Dave Gandy.