SURF
Goals
- Learn the history of text-based serialization formats.
- Understand the components of a SURF document.,
- Gain the ability to create SURF documents, both from scratch and from existing JSON files.
Concepts
- alias
- binary format
- collection
- second-level domain
- decimal
- Extensible Markup Language (XML)
- handle
- ID
- instant
- Internationalized Resource Identifier (IRI)
- JavaScript Object Notation (JSON)
- label
- markup tag
- name
- informal namespace
- namespace prefix
- property
- resource
- Resource Description Framework (RDF)
- semantics
- semantic framework
- serialize
- Simple URF (SURF)
- Standard Generalized Markup Language (SGML)
- tag
- terminator
- triple
- Uniform Resource Framework (URF)
- URF processor
- Universally Unique Identifier (UUID)
- verbose
- World Wide Web Consortium (W3C)
Library
Preview
Lesson
The primary purpose of a computer program is to process data, usually in the form of objects. This data processing usually occurs in the computer's memory, but at some point the program needs to serialize the object, turning them into a series of bytes, so that they can be stored in a file or transferred to another system. Serialization is used for everything from configuration files to instant message transport. The Simple URF (SURF) format strives to be a serialization format for the reactive Internet era, striking a balance between simplicity (being almost as simple and slightly more compact as JSON) and expressiveness (having more types than JSON, extensibility through vocabularies, and compatibility with semantic frameworks).
History
Binary Serializations
Early serialization approaches used binary format and were not easily read by humans, as they used arbitrary numbers as delimiters and represented data in their binary form as stored in memory.
- Good
-
- Binary formats are very compact.
- Bad
-
- Binary formats are hard to debug, as they are not readily human-readable.
- Many binary formats are proprietary, impeding interoperability.
Markup Languages
Early markup languages such as the Standard Generalized Markup Language (SGML) from the 1980s weren't originally meant to store objects but to add annotations to text data. Markup tag pairs such as “<para>…</para>
” (indicating a paragraph) were inserting surrounding text to allow it to be styled for publishing. In the early 1990s Tim Berners-Lee invented the HyperText Markup Language (HTML), an implementation of SGML that provided simple markup for web pages using tag pairs such as “<p>…</p>
”. (The latest version, HTML5, is the current best-practices format for creating web content.)
HTML was not meant for serializing general data such as objects, and SGML was too complex, so in the later 1990s the World Wide Web Consortium (W3C) created the Extensible Markup Language (XML), a simplified version of SGML. XML provided a simple way to define the structure of any data that could be stored in text form, yet did not define the actual tag names used or their semantics (that is, how they were to be interpreted). A typical XML document might appear like the one below:
- Good
-
- XML is human-readable.
- XML is standardized and ubiquitous.
- XML works well for annotating text data.
- XML provides a mechanism for mixing vocabularies.
- Bad
-
- XML is verbose, sometimes tripling the size of the data being encoded.
- XML arbitrarily distinguishes two types of data: “attributes” (e.g. the
authenticated
attribute) and “child elements” (e.g. the<name>
tag). - XML itself has not way to specify the types of data (e.g. that
true
is a Boolean value,jdoe
is a string,123
is a number, and2016-01-23
is a date). - XML retains some complicated features that are becoming seldom used yet all parsers must support (e.g. DTDs, character references, CDATA sections).
- XML is only a syntax and has no means for referring to other things as in a general graph.
JSON
For almost two decades XML has been the default serialization format for many data, but it is recently being supplanted by newer, simpler formats. Web developers in particular desired a format that was smaller for transferring between the browser and a server. In the early 2000s the JavaScript Object Notation (JSON) format was formalized, using a subset of the syntax JavaScript used for declaring objects. As JavaScript was the primary language used in the browser, JSON could be parsed directly by evaluation as if it were a JavaScript program.
The central serialized type in JSON is an “object”, a comma-separated mapping of string keys to values inside brace {
and }
characters, such as {"foo" : "bar"}
. A value can be a string; a number; a comma-separated array inside bracket [
and ]
characters; a Boolean value true
or false
; the value null
; or another object. The same information above could be stored in JSON as follows:
- Good
-
- JSON is more compact than XML.
- JSON has no arbitrary distinction of child information as did XML with its “attributes”.
- JSON is more than a delimiter syntax; it encodes values.
- JSON is standardized and becoming ubiquitous.
- The JSON specification is smaller, easier to understand, and quicker to implement.
- Bad
-
- JSON is still somewhat verbose, requiring quotation
"
characters for object keys and comma,
characters between key-value pairs. - JSON has a very limited set of types; single characters, dates, URLs and the like must be represented by strings and parsed separately later by the consumer.
- JSON has no inherent way for objects to refer to other objects as in a graph.
- JSON has no mechanism for mixing vocabularies defined by independent parties.
- JSON does not support document-level metadata, and otherwise reflects its creation as Java
- JSON is still somewhat verbose, requiring quotation
In short JSON was not created from scratch to be a generic serialization format, and it shows. JSON was a convenient extraction of part of the JavaScript language which has proven to be very useful as an alternative to XML.
Semantic Frameworks
The W3C has since the 1990s been formulating the Resource Description Framework (RDF), a semantic framework for describing resources (objects or even abstract ideas) and the meaning of relationships between them. These relationships or properties are defined in a very formal sense, allowing reasoning via propositional logic. (For example, if Bob is the manager of Jane, and Jill is the manager of Bob, then Jill is the manager of Jane because the “manager” property is transitive.) In RDF, even the properties themselves are resources and can be described with their own properties.
RDF is for the most part a semantic framework independent of any serialization. Two popular serializations are RDF/XML (using special XML tags); and Turtle, a syntax that uses triples of subject, property, and object to represent semantic propositions.
- Good
-
- RDF allows objects
- RDF provides semantic representations among objects.
- RDF allows a reasonable set of types, leveraging XML Schema Datatypes (even in the Turtle syntax).
- RDF allows mixing of vocabularies.
- RDF allows references for graphs of objects.
- Bad
-
- Neither RDF/XML nor Turtle can be used without understanding the RDF data model specification, which is dense, complicated, and academic.
- RDF provides several redundant representations of even simple things such as strings.
- The RDF/XML serialization is extremely verbose.
- The Turtle serialization is verbose, unwieldy, and opaque.
URF, TURF, and SURF
In 2007, in an effort to create a simpler, elegant, and more consistent semantic framework, Garret Wilson and GlobalMentor, Inc. started work on the Uniform Resource Framework (URF). The TURF syntax for URF is meant to provide a comprehensive, text-based format for data archival; as well as satisfying the academic requirements of a rigorous semantic framework. The goal for SURF, however, is to provide a a simple serialization format that is as easy to use as JSON, while still maintaining URF semantics and compatibility with TURF.
SURF does not require any knowledge of semantic frameworks in using the syntax. In other words, SURF is an improved JSON that brings more types and more features while hardly increasing the complexity—and brings the rigor of a semantic framework for free.
- Good
-
- SURF is almost as simple as JSON.
- SURF is small; a SURF document can be more compact than JSON.
- SURF can be more readable than JSON, dispensing with unnecessary syntax such as commas between items in multi-line lists.
- SURF provides many more types than JSON.
- SURF allows mixing of vocabularies, yet without the need to think of bulky URIs.
- SURF allows references for graphs of objects.
- Bad
-
- No one has adopted SURF yet.
SURF
In addition to the benefits listed above, SURF has several useful characteristics:
- All JSON documents are also valid SURF documents.
- All SURF documents are also valid TURF documents.
JSON Compatibility
Any JSON document can be parsed as a valid SURF document! But SURF can be more expressive; compare the SURF and JSON documents presented earlier in this lesson. The additional features will be explained further in the following sections.
Comments
A comment may be added using the exclamation !
mark. The comment will continue until the end of the line.
Handles
SURF has the normal restrictions on identifiers as with many other formats and languages, except that SURF is fully Unicode aware. A SURF name must begin with a letter and follow with any number of letters, digits, and connectors such as the underscore _
character. Examples include Vehicle
, color
, and foo_bar
.
SURF provides a way to prevent clashes by placing names in informal namespaces. A name may begin with one or more prefixes, delimited by the hyphen -
character. These prefixes indicate a hierarchy of informal namespaces. Various parties may supply defined vocabularies which may be freely mixed in a SURF document. In such a case the namespaces indicated by the prefixes will prevent the identifiers from overlapping, even if they would otherwise have the same name. A name along with its prefix(es) is referred to as a handle.
For example, the name “salt
” could denote the additional cryptographic input to a password hash function (see Salt (cryptography)), or it could identify an ionic chemical compound (see Salt (chemistry)). SURF allows these two names to be distinguished by the addition of a namespace prefix. One might use the “crypto-
” prefix, while another might use the “chem-
” prefix, producing the handles crypto-salt
and chem-salt
, respectively.
SURF | JSON | |
---|---|---|
Handles |
| N/A |
Resource Descriptions
SURF is all about describing “resources”, which (in URF as in RDF) is anything that can be identified for discussion—that is, anything you can talk about. There are three types of resource representations in SURF:
- object
- A description of some resource, such as a web site.
- literal
- A representation of a resource that is can be identified by a lexical form, such as a number or a string.
- collection
- A resource that aggregates other resources.
Objects
SURF objects are analogous to the objects described by JSON. An anonymous object in SURF is denoted by the asterisk *
character, while in JSON it is represented by opening and closing brace {}
characters.
SURF | JSON | |
---|---|---|
Object | * | {} |
A bare object is somewhat boring, so both SURF and JSON allow them to be described. SURF uses the colon :
and semicolon ;
characters as delimiters for an object description block. Each element of a SURF object description consists of property, which is a SURF identifier, and its associated value, which is any SURF resource. The property and its value are separated using the equals =
sign, unlike JSON which uses a colon :
character.
SURF | JSON | |
---|---|---|
Resource Description |
|
|
You may also optionally specify your own custom type for any object by indicating the type name after the asterisk, as long as the type name follows the rules described for Handles, above. Examples include *User
and *chem-Molecule
.
SURF | JSON | |
---|---|---|
Custom Type |
| N/A |
Literals
Strings
Strings appear the same in SURF and in JSON.
SURF | JSON | |
---|---|---|
Strings |
|
|
Numbers
Numbers are represented the same in SURF and in JSON. But SURF comes with two number types that JSON doesn't have. One is an integer type. SURF also allows you to indicate that a numerical value is restricted to whole numbers by leaving off the fractional part and exponent. While the JSON considers both 5
and 5.0
to be general “numbers”, SURF considers the former to be specifically an integer type.
It is sometimes forgotten that IEEE 754 floating point numbers store only estimates of some numbers, 0.3
being one example. Most libraries use floating point for parsing and serializing in JSON, and SURF allows this as well. But because fractional estimates are not desired when dealing with some things such as money, SURF provides a decimal type which guarantees that the fractional part will be represented exactly. Decimals are indicated by the dollar sign $
prefix. This does not mean that this is a currency type, although the decimal type is often used to represent money.
SURF | JSON | |
---|---|---|
Numbers |
|
|
Integers |
| N/A |
Decimals |
| N/A |
Boolean
Both SURF and JSON have the same Boolean values, true
and false
.
SURF | JSON | |
---|---|---|
Boolean |
|
|
Dates and Times
SURF comes with extensive date and time handling, addressing one of the most glaring deficiencies of JSON. SURF supports the most common representations specified by ISO 8601 for absolute instances in time. Moreover SURF supports local date+time specifications, durations, and other temporal concepts present in modern date/time libraries.
SURF | JSON | |
---|---|---|
Instant | @2017-02-12T23:29:18.829Z | N/A |
ZonedDateTime | @2017-02-12T15:29:18.829-08:00[America/Los_Angeles] | N/A |
OffsetDateTime | @2017-02-12T15:29:18.829-08:00 | N/A |
OffsetDate | @2017-02-12-08:00 | N/A |
OffsetTime | @15:29:18.829-08:00 | N/A |
LocalDateTime | @2017-02-12T15:29:18.829 | N/A |
LocalDate | @2017-02-12 | N/A |
LocaleTime | @15:29:18.829 | N/A |
YearMonth | @2017-02 | N/A |
MonthDay | @--02-12 | N/A |
Year | @2017 | N/A |
Regular Expressions
SURF supports regular expressions as a first-class type along with strings, dates, etc. They are surrounded by the slash /
delimiter character, as they are in JavaScript, which makes it somewhat curious that JSON does not support them.
SURF | JSON | |
---|---|---|
Regular Expression | /a?b+c*/ | N/A |
IRIs
SURF can represent a Internationalized Resource Identifier (IRI) (RFC 3987) as its own type rather than as a string. A URL used to identify web addresses is the most well-known form of IRI. An IRI in SURF is surrounded by less than <
and greater than >
signs.
SURF | JSON | |
---|---|---|
IRI | <http://example.com/> | N/A |
UUIDs
SURF can also represent a different type of identifier known as a Universally Unique Identifier (UUID) (RFC 4122). This identifier is a 128-bit value that can be generated on separate systems according to a certain algorithm, yet with a extremely small chance of two UUIDs being identical across all systems. As with URIs SURF represents a UUID as a separate type, not as a string.
The SURF UUID representation begins with the ampersand &
character (think of the reference operator in C/C++) followed by the canonical UUID representation of groups of hexadecimal octets in the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
. That is, 32 hexadecimal digits appear as a total of 36 character in the form 8-4-4-4-12
.
SURF | JSON | |
---|---|---|
UUID | &5623962b-22b1-4680-ae1c-7174a46144fc | N/A |
Email Addresses
A common way of uniquely identifying people is by email address. SURF comes with a special type to represent emails as opposed to general strings. SURF emails follow the address specification in RFC 5322, with the addition of the circumflex accent or “caret” ^
character at the beginning. (Think of the paper airplane symbol representing “send email” in many user interfaces.)
SURF | JSON | |
---|---|---|
Email Address | ^jdoe@example.com | N/A |
Telephone Numbers
Another popular identifier is the telephone number, and SURF has a literal representation for those as well. A SURF telephone number follows the syntax for “global numbers” (i.e. those that begin with the plus +
sign) described in RFC 3966. The visual separators allowed by RFC 3966 are optional. SURF does not allow telephone number parameters.
SURF | JSON | |
---|---|---|
Telephone Number | +12015550123 | N/A |
Binary Values
SURF allows you represent a series of bytes. The percent %
character is used as the beginning delimiter (think of the percent symbol as representing 1
s and 0
s), with the following binary data encoded as Base64 (RFC 4648) using the “base64url” alphabet. SURF does not allow Base64 padding.
SURF | JSON | |
---|---|---|
Binary | %Zm9vYmFy | N/A |
Collections
Lists
Lists in SURF are presented like arrays in JSON using the bracket [
and ]
characters. An added benefit over JSON is that you do not need the comma , separator if you place the list items on separate lines.
SURF | JSON | |
---|---|---|
List |
|
|
Sets
Sets are unordered collections that do not allow duplicates. They are formatted the same as list, except that they use parentheses (
and )
characters.
SURF | JSON | |
---|---|---|
Set |
| N/A |
Maps
Maps are associations of values with unique key values. SURF represents map key+value entries inside brace {
and }
characters, with each key and value separated by a colon :
character. Unlike the keys of JSON associative arrays, SURF map keys do not have to be strings. The comma ,
separator is not needed if the map entries appear on on separate lines.
SURF | JSON | |
---|---|---|
Map |
| N/A |
Labels
In most programming languages, an in-memory graph of objects may have several references to the same instance. Many serialization formats, such as JSON, do not have built-in mechanisms for referencing nodes serialized elsewhere in the document. Other formats such as XML require special processors that understand additional specifications such as XML Schema, or rely on proprietary application-level interpretations completely outside the format specifications. SURF allows you to reference any resource within a document by using a label, which consists of an identifier inside vertical line |
characters and placed in front of a resource. SURF supports three types of labels, which differ in their resolving power, that is, how broadly the label will identify a unique resource.
Aliases
At the finest granularity, you can create an alias by placing any SURF name inside the label delimiters, such as |foo|
. An alias can be given to any resource. After assigning a tag to a resource, the alias can then be used anywhere in the SURF document that the original resource could have been used. The SURF parser will recognize all occurrences of an alias as referring to the same resource instance. An alias only exists as syntax within a SURF document; an alias itself is not present in the object graph a SURF parser returns.
In a typical web authentication scenario, several web pages may indicate that only a user with the “administrator” role may have access. Without a built-in mechanism for referencing resources, the usual workaround is to create an extra identifier field and require the processing application to understand that these references should create links, as in the following example:
Adding SURF aliases moves the linking semantics into the document itself, relieving the application from the need to manually make connections based upon some proprietary scheme. In the example below, a SURF parser will automatically create references to the Role
resources; no extra work is required on the part of the application. This example removes the role id
property to indicate it is no longer needed for referencing within the SURF document.
Tags
If you are describing a resource such as web site (or in the semantic world, even people or abstract ideas) that is already identified by an IRI, you can assign a tag to the resource. A tag is the identifying IRI of the object and is placed inside the label along with IRI delimiters. For example the tag |<http://example.com/>|
is used to identify the web site <http://example.com/>
. A tag becomes the resource's official global identifier, and will continue to exist outside the SURF document.
You could improve the SURF document above by identifying the two page resources by their IRIs. After parsing the document an application can ask each of the page resources for its identifier. Moreover you could reference resources within the SURF document by using their tags, just as you can do with aliases. This example removes the page url
property to indicate it is no longer needed, as the page objects themselves are identified by IRIs.
IDs
SURF provides one more kind of reference identifier called an ID. Like a tag, an ID is assigned the object an exists outside the SURF document. Unlike a tag, which provides a global unique identification, an ID uniquely identifies a resource only among the resources of the indicated type. For this reason if you indicate an ID for a resource, you must also indicate a resource type. You can specify an ID by providing a string such as "user"
(note the surrounding double quote "
characters) as the label identifier.
URF
On its own the SURF format provides a concise, elegant, and flexible syntax for data storage. Yet unlike other formats that only address syntax, SURF is built on a rigorous semantic model named the Uniform Resource Framework (URF). No understanding of semantic frameworks is needed to use SURF; the information presented here merely provides a taste of the consistency URF provides, along with some of the ways URF makes it possible to access and process data.
While providing a better data storage format, SURF sneaks in a rigorous data model that is as capable as RDF yet simpler and more consistent. Although a SURF parser that is not aware of URF can still extract an object graph from the SURF syntax, a SURF parser that is also an URF processor can produced additional knowledge and make semantic inferences.
Tag IRIs
Every SURF handle is in fact represented by a unique tag IRI. Related handles belong to a namespace, which is also identified by an IRI. An URF processor will automatically map SURF handles to URF tag IRIs:
- If a SURF handle has no namespace prefix, it is placed in the ad-hoc namespace identified by
https://urf.name/
. Thus the SURF handlefoo
is identified in URF by the tag IRIhttps://urf.name/foo
. - If a SURF tag has one or more namespace prefixes, a namespace IRI is formed relative to
https://urf.name/
using those prefixes, with prefix delimiters replaced by the slash/
character. Thus the SURF handleexample-foo
is identified in URF by the tag IRIhttps://urf.name/example/foo
. - If a SURF object has an ID, its encoded ID is added as the fragment of the type tag IRI. Thus Thus the object
|"bar"|*example-foo
is identified in URF by the tag IRIhttps://urf.name/example/foo#bar
, of the type identified by the tag IRIhttps://urf.name/example/foo
.
Resources
In URF everything that can be described is a resource. Even simple value types such as strings or integers are also resources, each conceptually identified by a unique tag IRI. For more information on forming tag URIs for common value resources, see the URF specification.
Statements
SURF documents processed by an URF processor are equivalent to a set of logical propositions or statements. Similar to those in RDF, URF statements consist of a subject, a property, and a value resource. Additionally each URF resource may have an associated type. The example at the beginning of this lesson would be understood by an URF processor as representing the following statements:
Subject | Property | Value |
---|---|---|
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | authenticated (urf-Property ) | true (urf-Boolean ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | sort (urf-Property ) | 'd' (urf-Character ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | name (urf-Property ) | "Jane Doe" (urf-String ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | email (urf-Property ) | ^jane_doe@example.com (urf-Email Address) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | phone (urf-Property ) | +12015550123 (urf-TelephoneNumber ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | usernames (urf-Property ) | |accountList| (urf-Set ) |
|accountList| (urf-Set ) | urf-member+ (urf-Property ) | "jdoe" (urf-String ) |
|accountList| (urf-Set ) | urf-member+ (urf-Property ) | "janed" (urf-String ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | homePage (urf-Property ) | <http://www.example.com/jdoe/> (urf-Iri ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | salt (urf-Property ) | %Zm9vYmFy (urf-Binary ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | joined (urf-Property ) | @2016-01-23 (urf-LocalDate ) |
|<&bb8e7dbe-f0b4-4d94-a1cf-46ed0e920832>| (User ) | credits (urf-Property ) | 123 (urf-Integer ) |
Review
Summary
SURF | JSON |
---|---|
|
|
Gotchas
- JSON does not distinguish between general numbers and integers; parsing a JSON document with non-fractional numbers as SURF will gain semantics, even if unintended, based upon whether the JSON numbers were serialized with decimal points and/or exponents.
- While a SURF parser recognizes the JSON object syntax, it means a slightly different thing in SURF. JSON objects are interpreted as maps in JSON. If you want to define an object in SURF, use the SURF object syntax.
- SURF parsers recognize JSON
null
, but it is discarded and does not from the resulting SURF object graph. - SURF aliases are syntax only; they do not become part of the parsed model.
- Don't confuse an IRI such as
<http://www.example.com/>
, with a resource identified by that IRI using the tag|<http://www.example.com/>|
.
In the Real World
- A SURF parser can process any JSON text, with the understanding that JSON objects will be stored in SURF maps, and that numbers without decimal points and exponents will be considered integers.
Think About It
- Is the object you are describing uniquely identified by some identifier, such as a UUID or an email address? Rather than simply adding a proprietary property name to your object, it would be more useful semantically to use a tag for universal references recognized across vocabularies.
Self Evaluation
- What are the three types of resource representations available in SURF?
- Do the three resource representations results in different “types” of resources in the underlying URF data model?
- What is the difference between SURF objects and JSON objects?
- When do you need a comma to separate properties in SURF object descriptions?
- Which SURF types cannot be represented in JSON?
- In what cases would you want to use the TURF format instead of SURF?
Task
TODO
See Also
- Java SE 8 Date and Time (Ben Evans and Richard Warburton - Oracle)
References
- Simple URF (SURF) Specification
- RFC 3987: Internationalized Resource Identifiers (IRIs)
- RFC 4122: A Universally Unique IDentifier (UUID) URN Namespace
- RFC 4648: The Base16, Base32, and Base64 Data Encodings