Internet Protocols

Goal

Recognize the TCP/IP stack underlying Internet communication, and protocols that appear in each layer.
Understand IP addresses, ports, and URIs for resource identification.
Learn an overview of HTTP and experiment with simple HTTP connections.

Concepts

DNS server
domain name
Domain Name System (DNS)
HTTP body
HTTP header
HTTP method
HTTP request
HTTP response
HTTP response code
host
hostname
Hypertext Transfer Protocol (HTTP)
International Organization for Standardization (ISO)
Internationalized Resource Identifier (IRI)
Internet Engineering Task Force (IETF)
Internet protocol suite
IP address
IP address block
IP Version 4 (IPv4)
IP Version 6 (IPv6)
media type
Multipurpose Internet Mail Extension (MIME)
network address translation (NAT)
network byte order
octet
Open System Interconnection Model (OSI Model)
packet
port
protocol stack
Request for Comment (RFC)
resource
stateless
Transmission Control Protocol and Internet Protocol (TCP/IP)
Uniform Resource Identifier (URI)
Uniform Resource Locator (URL)
Uniform Resource Name (URN)
URI scheme
URN namespace
user agent

Library

Lesson

Rarely nowadays is an application completely isolated from others. Almost every program has at least one feature that requires it to be connected to some larger network, even if only to get updates from the Internet. Knowledge of network communication is essential for modern programmers. Today's network communication uses the Internet protocol suite, most commonly using HTTP over TCP/IP, two protocols you'll learn more about in this and upcoming lessons.

TCP/IP

Just as you've been designing your software by separating concerns into layers, network communication is also conceptually divided into a set of layers called a protocol stack. The layers of a protocol stack help to separate communication into different responsibilities For instance an email application can send mail using the high-level email protocol, without worrying about whether the computer is using a fixed cable or WiFi for a connection (which use different protocols for low-level signalling). The most common protocol stack used on the Internet is the Transmission Control Protocol and Internet Protocol (TCP/IP).

TCP/IP communication is performed by exchanging packets of information, each of which travels independently. Individual packets may take different routes and may even arrive at the destination out of order, to be reassembled correctly by the receiver. At each layer of communication, the packet of data from the higher next higher level is wrapped in a larger packet with a header appended. In this way data from higher levels can be sent without knowing or caring what the data actually contains. The four TCP/IP layers defined by RFC 1122 are the Application, Transport, Internet, and Link layers.

TCP/IP stack compared with OSI Model.

TCP/IP

Description

Packet

Examples

OSI Model

Application

Communication of actual user data.

Data

HTTP, FTP, SMTP

Application

Presentation

Session

Transport

Manages end-do-end communication, on the same or on different computers; indicates port.

UDP Header	Data

TCP, UDP

Transport

Internet

Manages routing packets across network boundaries; indicates IP address.

IP Header	UDP Header	Data

Network

Link

Navigates the specific protocol within each network type the packet passes through.

Frame Header	IP Header	UDP Header	Data	Frame Footer

ARP, PPP

Data Link

Packet diagram from Wikipedia.

Ethernet

Physical

IP Address

To communicate using TCP/IP, both the sender and receiver (referred to as hosts) must have a unique IP address, which is managed by the Internet Layer. For most of the history of the Internet until now, IP addresses used IP Version 4 (IPv4) addresses which, 32-bit values which are usually presented as four groups of decimal values separated by full stop (period) characters, for example 198.51.100.27.

Because of the incredible growth of the Internet, especially recently with multiple smaller devices connected, the number of available IPv4 quickly ran out. To address this problem the Internet is (very) slowly migrating to the use of IP Version 6 (IPv6) addresses, which contain 128 bits and are separated into eight groups of hexadecimal values separated by colon characters, for example 2001:0db8:fe09:0000:0000:0000:0000:001b. When representing IPv6 addresses, leading zeros can be removed, and consecutive sections of zeros can be replaced by two colon characters, e.g. 2001:db8:fe09::1b.

The special hostname localhost refers to this computer, can can be used on any computer to refer to the host itself. The special IP addresses 127.0.0.1 (IPv4) and ::1 (IPv6) are usually defined to represent localhost.

Port

TCP/IP does not limit communication to a single connection between hosts. Each host can have multiple communication channels open between other hosts or even between applications on the same host. The endpoint of these communication links is a port on the host, identified by a 16-bit number, and managed by the Transport Layer. Although many ports are free for any application to use, some ports are defined to be used with specific protocols. For example the HTTP protocol, discussed below, by default uses port 80.

DNS

As you learned during the early lesson on indirection, the Internet Domain Name System (DNS) provides a series of hierarchical names of machines across the Internet, such as www.example.com. Although TCP/IP packets are ultimately routed using IP addresses, domain names provides a practical way for humans to exchange addresses as well as provide a flexible level of indirection. A separate computer called a DNS server contains a mapping of domain names and their corresponding IP addresses.

Conceptual mapping of domain names to IP addresses on a DNS server.

Domain Name	IP Address
`www.myserver.com`	`1.2.3.4`
`www.example.com`	`xxx.xxx.xxx.xxx`

When you ask your computer to browse to http://www.myserver.com:

Your computer first goes out to a DNS server and asks for the IP address of www.myserver.com.
The DNS server responds that the IP address of www.myserver.com is 1.2.3.4.
Your computer then makes a connection directly to the computer with IP address 1.2.3.4 (the computer running the web site of www.myserver.com).

DNS lookup diagram for www.myserver.com resolving to IP address 1.2.3.4. — Simplified DNS lookup (Dyn).

This extra layer of indirection has two main benefits:

It allows you to specify web sites in terms of easy-to-remember domain names instead of cryptic IP addresses.
If a web site (e.g. www.myserver.com) decides to move its content from one server (e.g. with IP address 1.2.3.4) to another (e.g. with IP address 5.6.7.8), nothing has to change on your computer. Only the table on the DNS server needs to be updated.

URI

The basis for identifying resources (the general term for identifiable items, including web pages or images) on the Internet is the Uniform Resource Identifier (URI), defined most recently by RFC 3986. URIs begin with a scheme and a colon : character. A common URI scheme is http, found in web addresses.

There are two types of URIs: the Uniform Resource Locator (URL) and the Uniform Resource Name (URN). A URN uses the scheme urn and is followed by a URN-specific namespace indicating some formal identification scheme, followed by another colon : character. For example a URN for ISBN book identifiers is urn:isbn:9780486275437 identifies the Dover Thrift Edition of the book Alice's Adventures in Wonderland.

All URIs that are not URN are considered URLs. Many URLs contain a hierarchical part indicating how to locate a resource. A web addresses for example contains a domain name or IP address, an optional port separated by a colon : character, followed by a path. For example http://www.example.com:8080/foo/bar.txt is a typical format of a web address URL, which is also considered in general terms a URI.

scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

                    hierarchical part
        ┌───────────────────┴─────────────────────┐
                    authority               path
        ┌───────────────┴───────────────┐┌───┴────┐
  abc://username:password@example.com:123/path/data?key=value#fragid1
  └┬┘   └───────┬───────┘ └────┬────┘ └┬┘           └───┬───┘ └──┬──┘
scheme  user information     host     port            query   fragment

  urn:example:mammal:monotreme:echidna
  └┬┘ └──────────────┬───────────────┘
scheme              path

Generic URI components. (Wikipedia)

The optional query part of a URI indicates some qualification of how the resource is being requested. It consists of name=value parameter pairs separated from the rest of the URI by the question ? mark. Multiple query parameters are separate by ampersand & characters, such as http://www.example.com/numbers?min=90&limit=5.

Java comes with a class java.net.URI for representing a URI. The class comes with many ways to create an instance of a URI, including constructors and static factory methods. The simplest way to create a URI from a string you already know to contain a valid URI is to call URI.create(String str), which throws an IllegalArgumentException if the string is not in valid URI format. You can also resolve a path string to a URI instance, similar to resolving a path string to a Path instance for the file system, using URI.resolve(String str).

final URI exampleBaseURI = URI.create("http://www.example.com/");
final URI indexURI = exampleBaseURI.resolve("index.html");  //yields http://www.example.com/index.html

Two potentially confusing terms related to URIs are absolute and relative. You've already seen these terms when discussing paths on a file system. Similarly the path component of a URI, depending on the URI scheme, may be relative or absolute. You may also read documentation (including that for the java.net.URI class) that refers to an absolute or relative URI. The Javadocs for URI indicate that an absolute URI contains a scheme, while relative URIs may be resolved against absolute URIs. However there is no such thing as a relative URI! All URIs are absolute in the sense that they contain a scheme. What Java refers to as a relative URI is merely a path or other reference that can be resolved against an existing URI, sometimes called a URI reference. The Java URI class might therefore be better named URIReference because it can represent URIs or relative references.

Because a java.net.URI can contain URIs or references, you can also resolve a URI instance against another using URI.resolve(URI uri).

URI Encoding

The characters that may be stored in a URI are limited to a few ASCII characters, and some of them even have special meaning. (e.g. A slash / character is used to separate path segments.) Those characters that are outside the ASCII range or that are restricted must be encoded if they are to be included in a URI. RFC 3986 specifies that characters may be encoded in a URI by using the following algorithm:

Encode the character in UTF-8.
Encode each resulting byte in two uppercase hexadecimal digits preceded by a percent % sign (e.g. %0F).

Thus the word touché could thus be encoded in a URI relative to http://example.com/ as such:

http://example.com/touch%C3%A9

The java.net.URI class allows characters not allowed in URIs! If you are uncertain whether the URIs you are using are properly encoded according to RFC 3986, you can use the URI.toASCIIString() method to encode any non-ASCII characters.

URL

By far the most common type of URI is the URL. Here are some URL schemes you are likely to run into:

Scheme	Description	Default Port	Example
`file`	File on a file system. The `file` scheme has many inconsistencies, especially on Windows when used from Java.	N/A	`file:///usr/local/foo/bar.txt`
`ftp`	Resource accessible via FTP.	`20`	`ftp://ftp.example.com/foo/bar.txt`
`http`	Resource accessible via HTTP.	`80`	`http://www.example.com/foo/bar.txt`
`https`	Resource accessible via HTTP over SSL/TLS.	`443`	`https://www.example.com/foo/bar.txt`
`mailto`	Email address	varies	`mailto:jdoe@example.com`
`tel`	Telephone number.	N/A	`tel:+1-415-555-0123`

You may see some API methods referring to URL encoding, but this is not the same s URI encoding! URL encoding (using the media type application/x-www-form-urlencoded, which you will learn about in a future lesson) is an older approach used when submitting HTML forms on the web. Notably URL encoding uses a plus + sign rather than %20 to encode spaces.

The java.net.URL class predates the java.net.URI class and represents not only a URL endpoint, but also several implementations of how to retrieve data from the URL.

As the Path interface supercedes the File class by representing the path itself, a URI instance should be used in most cases rather than a URL instance. The URI class provides more flexibility in manipulating URIs; supports later RFC specifications related to URIs; and doesn't get encumbered with the actual logic needed to access a URI. You will nevertheless need to convert the URI instance a URL instance if you need to create a URLConnection (see below).

Media Types

The data you can transfer across the Internet comes in various flavors. An HTTP GET request could retrieve a plain text file, an HTML file, an audio clip, or an image. Rather than relying on filename extension, Internet architecture uses a more robust mechanism called a media type for determining the type of an entity. Most recently specified in RFC 6838, media types have a formal IETF registration process and provide fixed, descriptive identifiers for media content.

A media type has a specially formatted identifier consisting of a type and subtype. Some common media types include:

text/plain
text/html
application/json
application/xml
image/jpeg
image/png
audio/mpeg

The media type value allows a suffix, separated from the media type by a semicolon ; character, with additional parameters. The most common media type value parameter is charset, which indicates using the equal = character the charset of the type. The following media type would indicate a type of plain text with a UTF-8 charset:

text/plain; charset=UTF-8

HTTP

The Hypertext Transfer Protocol (HTTP) is a application-level protocol in the TCP/IP layer. It is the basis of browsing the world-wide web, as well as the foundation for many new Internet-based APIs. HTTP has almost become synonymous with the Internet.

Communication with HTTP is conceptually very simple, which is one reason for its ubiquity. An HTTP command consists of a request and a response. Each HTTP request includes a method (or verb) indicating an action to perform. The response indicates the result of the request.

Request: verb resource-URI [content] (Do something with the identified resource using the given content, if any.)
Response: response-code [content] (Here is the outcome of the request, with content if appropriate.)

The most commonly used used HTTP method is GET, which indicates that the user agent (such as a web browser) wants to retrieve a resource (such as a web page). The response code 200 indicates that everything went OK; the requested web page will be returned in the body of the response. At the beginning of each request and response is a series of headers—names and values, separated by a colon : character—which provide more information about the message. The conversation may look like this when retrieving the http://www.example.com/index.html home page:

Example HTTP Request (Wikipedia)

GET /index.html HTTP/1.1
Host: www.example.com

Example HTTP Response (Wikipedia)

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
ETag: "3f80f-1b6-3e1cb03b"
Content-Type: text/html; charset=UTF-8
Content-Length: 138
Accept-Ranges: bytes
Connection: close

<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World, this is a very simple HTML document.
</body>
</html>

Most successful HTTP requests return a response code of 200 meaning OK. If a resource cannot be located, HTTP will return the response code 404 indicating Not Found.

`URLConnection`

Java has provided an abstract class java.net.URLConnection, which allows reading from and writing to a URL location. Most important in the current context is the abstract subclass java.net.HttpURLConnection, which provides special access to HTTP-specific information. You can specify the HTTP method to use with the HttpURLConnection.setRequestMethod(String method) method, and then ask for the response code sent sent back in the response by calling HttpURLConnection.getResponseCode().

You can get an instance of URLConnection by calling URL.openConnection(). If the URL you provided has one of the HTTP schemes, you can be sure that the returned value will be an HttpURLConnection:

final URI exampleURI = URI.create("http://www.example.com/index.html");
final HttpURLConnection connection = (HttpURLConnection)exampleURI.toURL().openConnection();
connection.setRequestMethod("GET");	//included for clarity; GET is already the default
final int responseCode = connection.getResponseCode(); //TODO check response code
try(final InputStream inputStream = new BufferedInputStream(connection.getInputStream())) {
  //TODO read from the input stream
}

The URL class also provides a way to directly open an input stream to the URL it represents, but using a URLConnection provides much more flexibility.

Review

Gotchas

TODO talk about absolute/relative in relation to URIs
TODO talk about URI constructors and encoded/raw
TODO talk about URI.toString()
TODO talk about File URIs
Do not use URL encoding (which uses the plus + sign to encode spaces) when you are encoding non-ASCII characters in URIs. Only use URL encoding for HTML form submission data.

In the Real World

Almost every URI you will encounter will be a URL. You will seldom see an actual URN, but you should know that they exist.

Think About It

TODO

Self Evaluation

What is is the OSI model and how does it relate to TCP/IP?
What is the difference between a URI and a URL?
What does it mean that HTTP is stateless?
Is there such a thing as a relative URI?
Is it possible for an HTTP URI to have a relative path?

Task

The Google Books API is an HTTP-based interface for accessing the Google Books repository. For this homework you will query a URI using the following format using version 1 of the API:

https://www.googleapis.com/books/v1/volumes?q=isbn:isbn

isbn: The requested ISBN.

Add a new info command-line option to the Booker program to provide information on a particular book. Add a new --lookup flag to the Booker program, to be used in conjunction with the list command, indicating that information should be retrieved from the Internet. If present retrieve and print information from the Google Books API.

Use the provided ISBN to retrieve information from Google Books API. Print out whatever results are returned.
Store the Google Books API v1 base URI in a separate constant, and use URI.resolve(…) to produce the full query URI (perhaps storing it in a separate constant).
Use a URLConnection with the HTTP GET method.
If the response code is 200 (OK), assume the response content is text encoded in UTF-8. Otherwise, present an error.

Example usage: booker list --isbn 9780486275437 --lookup

booker list [--debug] [--locale <locale>] [--isbn <ISBN> | --issn <ISSN>] [--name <name>] [--type (book|periodical)] [--lookup]
booker load-snapshot [--debug] [--locale <locale>]
booker purchase --isbn <ISBN> [--debug] [--locale <locale>]
booker subscribe --issn <ISSN> [--debug] [--locale <locale>]
booker -h | --help

Option	Alias	Description
`list`		Lists all available publications.
`load-snapshot`		Loads the snapshot list of publications into the current repository.
`purchase`		Removes a single copy of the book identified by ISBN from stock.
`subscribe`		Subscribes to a year's worth of issues of the periodical identified by ISSN.
`--debug`	`-d`	Includes debug information in the logs.
`--help`	`-h`	Prints out a help summary of available switches.
`--isbn`		Identifies a book, for example for the `purchase` command.
`--issn`		Identifies a periodical, for example for the `subscribe` command.
`--locale`	`-l`	Indicates the locale to use in the program, overriding the system default. The value is in language tag format.
`--lookup`		Retrieves from the Internet information on a book identified by its ISBN.
`--name`	`-n`	Indicates a filter by name for the `list` command.
`--type`	`-t`	Indicates the type of publication to list, either book or periodical. If not present, all publications will be listed.

References

Resources

Acknowledgments

URI Euler Diagram by David Torres original authorderivative work: Qwerty0 (URI_VENN_DIAGRAM.SVG) [CC BY-SA 3.0 or GFDL], via Wikimedia Commons.
URI syntax diagrams from Wikipedia, under the Creative Commons Attribution-ShareAlike License.
DNS lookup diagram from blog post by Liviu Tudor on the web site of Dyn, a company that provides DNS services. Used with permission from EMEA Marketing Coordinator on 2015-11-12.

Internet Protocols

Goal

Concepts

Library

Lesson

TCP/IP

IP Address

Port

DNS

URI

URI Encoding

URL

Media Types

HTTP

URLConnection

Review

Gotchas

In the Real World

Think About It

Self Evaluation

Task

See Also

References

Resources

Acknowledgments

`URLConnection`