Internet Protocols
Goal
- Recognize the TCP/IP stack underlying Internet communication, and protocols that appear in each layer.
- Understand IP addresses, ports, and URIs for resource identification.
- Learn an overview of HTTP and experiment with simple HTTP connections.
Concepts
- DNS server
- domain name
- Domain Name System (DNS)
- HTTP body
- HTTP header
- HTTP method
- HTTP request
- HTTP response
- HTTP response code
- host
- hostname
- Hypertext Transfer Protocol (HTTP)
- International Organization for Standardization (ISO)
- Internationalized Resource Identifier (IRI)
- Internet Engineering Task Force (IETF)
- Internet protocol suite
- IP address
- IP address block
- IP Version 4 (IPv4)
- IP Version 6 (IPv6)
- media type
- Multipurpose Internet Mail Extension (MIME)
- network address translation (NAT)
- network byte order
- octet
- Open System Interconnection Model (OSI Model)
- packet
- port
- protocol stack
- Request for Comment (RFC)
- resource
- stateless
- Transmission Control Protocol and Internet Protocol (TCP/IP)
- Uniform Resource Identifier (URI)
- Uniform Resource Locator (URL)
- Uniform Resource Name (URN)
- URI scheme
- URN namespace
- user agent
Library
java.net.HttpURLConnection
java.net.HttpURLConnection.HTTP_NOT_FOUND
java.net.HttpURLConnection.HTTP_OK
java.net.HttpURLConnection.getResponseCode()
java.net.HttpURLConnection.setRequestMethod(String method)
java.net.MalformedURLException
java.net.URI
java.net.URI.create(String str)
java.net.URI.resolve(String str)
java.net.URI.resolve(URI uri)
java.net.URI.toASCIIString()
java.net.URI.toURL()
java.net.URL
java.net.URL.openConnection()
java.net.URLConnection
Lesson
Rarely nowadays is an application completely isolated from others. Almost every program has at least one feature that requires it to be connected to some larger network, even if only to get updates from the Internet. Knowledge of network communication is essential for modern programmers. Today's network communication uses the Internet protocol suite, most commonly using HTTP over TCP/IP, two protocols you'll learn more about in this and upcoming lessons.
TCP/IP
Just as you've been designing your software by separating concerns into layers, network communication is also conceptually divided into a set of layers called a protocol stack. The layers of a protocol stack help to separate communication into different responsibilities For instance an email application can send mail using the high-level email protocol, without worrying about whether the computer is using a fixed cable or WiFi for a connection (which use different protocols for low-level signalling). The most common protocol stack used on the Internet is the Transmission Control Protocol and Internet Protocol (TCP/IP).
TCP/IP communication is performed by exchanging packets of information, each of which travels independently. Individual packets may take different routes and may even arrive at the destination out of order, to be reassembled correctly by the receiver. At each layer of communication, the packet of data from the higher next higher level is wrapped in a larger packet with a header appended. In this way data from higher levels can be sent without knowing or caring what the data actually contains. The four TCP/IP layers defined by RFC 1122 are the Application, Transport, Internet, and Link layers.
TCP/IP | Description | Packet | Examples | OSI Model | |||||
---|---|---|---|---|---|---|---|---|---|
Application | Communication of actual user data. |
| HTTP, FTP, SMTP | Application | |||||
Presentation | |||||||||
Session | |||||||||
Transport | Manages end-do-end communication, on the same or on different computers; indicates port. |
| TCP, UDP | Transport | |||||
Internet | Manages routing packets across network boundaries; indicates IP address. |
| IP | Network | |||||
Link | Navigates the specific protocol within each network type the packet passes through. |
| ARP, PPP | Data Link | |||||
Packet diagram from Wikipedia. | Ethernet | Physical |
IP Address
To communicate using TCP/IP, both the sender and receiver (referred to as hosts) must have a unique IP address, which is managed by the Internet Layer. For most of the history of the Internet until now, IP addresses used IP Version 4 (IPv4) addresses which, 32-bit values which are usually presented as four groups of decimal values separated by full stop (period) characters, for example 198.51.100.27
.
Because of the incredible growth of the Internet, especially recently with multiple smaller devices connected, the number of available IPv4 quickly ran out. To address this problem the Internet is (very) slowly migrating to the use of IP Version 6 (IPv6) addresses, which contain 128 bits and are separated into eight groups of hexadecimal values separated by colon characters, for example 2001:0db8:fe09:0000:0000:0000:0000:001b
. When representing IPv6 addresses, leading zeros can be removed, and consecutive sections of zeros can be replaced by two colon characters, e.g. 2001:db8:fe09::1b
.
Port
TCP/IP does not limit communication to a single connection between hosts. Each host can have multiple communication channels open between other hosts or even between applications on the same host. The endpoint of these communication links is a port on the host, identified by a 16-bit number, and managed by the Transport Layer. Although many ports are free for any application to use, some ports are defined to be used with specific protocols. For example the HTTP protocol, discussed below, by default uses port 80.
DNS
As you learned during the early lesson on indirection, the Internet Domain Name System (DNS) provides a series of hierarchical names of machines across the Internet, such as www.example.com
. Although TCP/IP packets are ultimately routed using IP addresses, domain names provides a practical way for humans to exchange addresses as well as provide a flexible level of indirection. A separate computer called a DNS server contains a mapping of domain names and their corresponding IP addresses.
Domain Name | IP Address |
---|---|
www.myserver.com | 1.2.3.4 |
www.example.com | xxx.xxx.xxx.xxx |
When you ask your computer to browse to http://www.myserver.com
:
- Your computer first goes out to a DNS server and asks for the IP address of
www.myserver.com
. - The DNS server responds that the IP address of
www.myserver.com
is1.2.3.4
. - Your computer then makes a connection directly to the computer with IP address
1.2.3.4
(the computer running the web site ofwww.myserver.com
).
This extra layer of indirection has two main benefits:
- It allows you to specify web sites in terms of easy-to-remember domain names instead of cryptic IP addresses.
- If a web site (e.g.
www.myserver.com
) decides to move its content from one server (e.g. with IP address1.2.3.4
) to another (e.g. with IP address5.6.7.8
), nothing has to change on your computer. Only the table on the DNS server needs to be updated.
URI
The basis for identifying resources (the general term for identifiable items, including web pages or images) on the Internet is the Uniform Resource Identifier (URI), defined most recently by RFC 3986. URIs begin with a scheme and a colon :
character. A common URI scheme is http
, found in web addresses.
There are two types of URIs: the Uniform Resource Locator (URL) and the Uniform Resource Name (URN). A URN uses the scheme urn
and is followed by a URN-specific namespace indicating some formal identification scheme, followed by another colon :
character. For example a URN for ISBN book identifiers is urn:isbn:9780486275437
identifies the Dover Thrift Edition of the book Alice's Adventures in Wonderland.
All URIs that are not URN are considered URLs. Many URLs contain a hierarchical part indicating how to locate a resource. A web addresses for example contains a domain name or IP address, an optional port separated by a colon :
character, followed by a path. For example http://www.example.com:8080/foo/bar.txt
is a typical format of a web address URL, which is also considered in general terms a URI.
java.net.URI
for representing a URI. The class comes with many ways to create an instance of a URI
, including constructors and static factory methods. The simplest way to create a URI from a string you already know to contain a valid URI is to call URI.create(String str)
, which throws an IllegalArgumentException
if the string is not in valid URI format. You can also resolve a path string to a URI
instance, similar to resolving a path string to a Path
instance for the file system, using URI.resolve(String str)
.
final URI exampleBaseURI = URI.create("http://www.example.com/");
final URI indexURI = exampleBaseURI.resolve("index.html"); //yields http://www.example.com/index.html
URI Encoding
The characters that may be stored in a URI are limited to a few ASCII characters, and some of them even have special meaning. (e.g. A slash /
character is used to separate path segments.) Those characters that are outside the ASCII range or that are restricted must be encoded if they are to be included in a URI. RFC 3986 specifies that characters may be encoded in a URI by using the following algorithm:
- Encode the character in UTF-8.
- Encode each resulting byte in two uppercase hexadecimal digits preceded by a percent
%
sign (e.g.%0F
).
Thus the word touché
could thus be encoded in a URI relative to http://example.com/
as such:
URL
By far the most common type of URI is the URL. Here are some URL schemes you are likely to run into:
Scheme | Description | Default Port | Example |
---|---|---|---|
file | File on a file system. The file scheme has many inconsistencies, especially on Windows when used from Java. | N/A | file:///usr/local/foo/bar.txt |
ftp | Resource accessible via FTP. | 20 | ftp://ftp.example.com/foo/bar.txt |
http | Resource accessible via HTTP. | 80 | http://www.example.com/foo/bar.txt |
https | Resource accessible via HTTP over SSL/TLS. | 443 | https://www.example.com/foo/bar.txt |
mailto | Email address | varies | mailto:jdoe@example.com |
tel | Telephone number. | N/A | tel:+1-415-555-0123 |
The java.net.URL
class predates the java.net.URI
class and represents not only a URL endpoint, but also several implementations of how to retrieve data from the URL.
Media Types
The data you can transfer across the Internet comes in various flavors. An HTTP GET
request could retrieve a plain text file, an HTML file, an audio clip, or an image. Rather than relying on filename extension, Internet architecture uses a more robust mechanism called a media type for determining the type of an entity. Most recently specified in RFC 6838, media types have a formal IETF registration process and provide fixed, descriptive identifiers for media content.
A media type has a specially formatted identifier consisting of a type and subtype. Some common media types include:
text/plain
text/html
application/json
application/xml
image/jpeg
image/png
audio/mpeg
The media type value allows a suffix, separated from the media type by a semicolon ;
character, with additional parameters. The most common media type value parameter is charset
, which indicates using the equal =
character the charset of the type. The following media type would indicate a type of plain text with a UTF-8 charset:
text/plain; charset=UTF-8
HTTP
The Hypertext Transfer Protocol (HTTP) is a application-level protocol in the TCP/IP layer. It is the basis of browsing the world-wide web, as well as the foundation for many new Internet-based APIs. HTTP has almost become synonymous with the Internet.
Communication with HTTP is conceptually very simple, which is one reason for its ubiquity. An HTTP command consists of a request and a response. Each HTTP request includes a method (or verb
) indicating an action to perform. The response indicates the result of the request.
- Request:
verb resource-URI [content]
(Do something with the identified resource using the given content, if any.
) - Response:
response-code [content]
(Here is the outcome of the request, with content if appropriate.
)
The most commonly used used HTTP method is GET
, which indicates that the user agent (such as a web browser) wants to retrieve a resource (such as a web page). The response code 200 indicates that everything went OK
; the requested web page will be returned in the body of the response. At the beginning of each request and response is a series of headers—names and values, separated by a colon :
character—which provide more information about the message. The conversation may look like this when retrieving the http://www.example.com/index.html
home page:
URLConnection
Java has provided an abstract class java.net.URLConnection
, which allows reading from and writing to a URL location. Most important in the current context is the abstract subclass java.net.HttpURLConnection
, which provides special access to HTTP-specific information. You can specify the HTTP method to use with the HttpURLConnection.setRequestMethod(String method)
method, and then ask for the response code sent sent back in the response by calling HttpURLConnection.getResponseCode()
.
You can get an instance of URLConnection
by calling URL.openConnection()
. If the URL you provided has one of the HTTP schemes, you can be sure that the returned value will be an HttpURLConnection
:
Review
Gotchas
- TODO talk about absolute/relative in relation to URIs
- TODO talk about URI constructors and encoded/raw
- TODO talk about URI.toString()
- TODO talk about File URIs
- Do not use URL encoding (which uses the plus
+
sign to encode spaces) when you are encoding non-ASCII characters in URIs. Only use URL encoding for HTML form submission data.
In the Real World
- Almost every URI you will encounter will be a URL. You will seldom see an actual URN, but you should know that they exist.
Think About It
- TODO
Self Evaluation
- What is is the OSI model and how does it relate to TCP/IP?
- What is the difference between a URI and a URL?
- What does it mean that
HTTP is stateless
? - Is there such a thing as a relative URI?
- Is it possible for an HTTP URI to have a relative path?
Task
Add a new info
command-line option to the Booker program to provide information on a particular book. Add a new --lookup
flag to the Booker program, to be used in conjunction with the list
command, indicating that information should be retrieved from the Internet. If present retrieve and print information from the Google Books API.
- Use the provided ISBN to retrieve information from Google Books API. Print out whatever results are returned.
- Store the Google Books API v1 base URI in a separate constant, and use
URI.resolve(…)
to produce the full query URI (perhaps storing it in a separate constant). - Use a
URLConnection
with the HTTPGET
method. - If the response code is
200
(OK
), assume the response content is text encoded in UTF-8. Otherwise, present an error.
Example usage: booker list --isbn 9780486275437 --lookup
booker list [--debug] [--locale <locale>] [--isbn <ISBN> | --issn <ISSN>] [--name <name>] [--type (book|periodical)] [--lookup]
booker load-snapshot [--debug] [--locale <locale>]
booker purchase --isbn <ISBN> [--debug] [--locale <locale>]
booker subscribe --issn <ISSN> [--debug] [--locale <locale>]
booker -h | --help
Option | Alias | Description |
---|---|---|
list | Lists all available publications. | |
load-snapshot | Loads the snapshot list of publications into the current repository. | |
purchase | Removes a single copy of the book identified by ISBN from stock. | |
subscribe | Subscribes to a year's worth of issues of the periodical identified by ISSN. | |
--debug | -d | Includes debug information in the logs. |
--help | -h | Prints out a help summary of available switches. |
--isbn | Identifies a book, for example for the purchase command. | |
--issn | Identifies a periodical, for example for the subscribe command. | |
--locale | -l | Indicates the locale to use in the program, overriding the system default. The value is in language tag format. |
--lookup | Retrieves from the Internet information on a book identified by its ISBN. | |
--name | -n | Indicates a filter by name for the list command. |
--type | -t | Indicates the type of publication to list, either book or periodical. If not present, all publications will be listed. |
See Also
- TCP/IP Protocol Fundamentals Explained with a Diagram (Himanshu Arora - The Geek Stuff)
- HTTP: The Protocol Every Web Developer Must Know - Part 1 (Pavan Podila)
- Internet protocol suite (Wikipedia)
- Hypertext Transfer Protocol (Wikipedia)
- Reading from and Writing to a URLConnection ((Oracle - The Java™ Tutorials)
- Google Books APIs Overview (Google Developers)
References
- Google Books API v1: Getting Started (Google Developers)
- RFC 1122: Requirements for Internet Hosts -- Communication Layers
- RFC 3849: IPv6 Address Prefix Reserved for Documentation
- RFC 3986: Uniform Resource Identifier (URI): Generic Syntax
- RFC 3987: Internationalized Resource Identifiers (IRIs)
- RFC 5737: IPv4 Address Blocks Reserved for Documentation
- RFC 6838: Media Type Specifications and Registration Procedures
- RFC 7230: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing
- RFC 7231: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content
Resources
- Google Books APIs (Google Developers)
- RFC Search (IETF)
- Internet Engineering Task Force (IETF)
- International Organization for Standardization (ISO)
Acknowledgments
- URI Euler Diagram by David Torres original authorderivative work: Qwerty0 (URI_VENN_DIAGRAM.SVG) [CC BY-SA 3.0 or GFDL], via Wikimedia Commons.
- URI syntax diagrams from Wikipedia, under the Creative Commons Attribution-ShareAlike License.
- DNS lookup diagram from blog post by Liviu Tudor on the web site of Dyn, a company that provides DNS services. Used with permission from EMEA Marketing Coordinator on 2015-11-12.