Updated URI test suite; resolving some issues...

Daniel W. Connolly (connolly@hal.com)
Wed, 16 Mar 1994 22:50:18 --100

I've updated my URI test suite
to address _lots_ more issues. While I was at it, I of course had to
tweak the grammar because of things I hadn't thought of.


While I was at it, I decided on a workable finalization of the set of
data characters. I started with the POSIX portable filename character
set (letters, digits, hyphen, underscore, and period). Then I looked
at the MIME recommendataions about characters that make it through
mailers without harm. But in the end, I settled on the set from the
isAcceptable table from HTParse.c in the libwww distribution out of

So the data characters are letters, numbers, period, hyphen,
underscore, at-sign and asterisk. In regexp-speak, that's


The first result of this is that the user@host in
is one word to the parser. So it's no longer part of the URI syntax --
it's specific to the FTP scheme. This is handy in that it makes the
grammar LR(1) again! There is a conflict when using user:passwd@host,
though. The ':' is special and can't be part of a word unless it's
escaped. So the full ftp syntax will have to change to:
or something else where the whole user/passwd/host triple is one word.


The other result of picking that char set is that all the other
characters ("!@#$%^&*()=+~`':...") are either markup or reserved.

This caused a conflict with WAIS URLs. So I extended the grammar to
include ';' and '=' as tokens, and added keyword=value syntax. So
the syntax for WAIS files is:
and the parser extracts the keyword/value pairs.

The keyword=value syntax is allowed in the path and in the search
string. So the syntax includes things like:


I have been thinking about near-term ways to deploy URNs. Even if
there is no generalized way to resolve a URN to a URL, they are
useful. For example, I have a whole bunch of cached documents from the
web in my local filesystem. But the connection between them and the
place they came from is lost. So when I'm browsing some document that
references the MIME RFC, for example, my browser has no way of taking
advantage of the fact that I've already got a copy of it locally. And
the problem scales as documents are copied, mirrored, cached, etc.

On the other hand, if we had an rfc: URN scheme registered, I could
perhaps configure my browser (or my proxy server) to map
rfc:* => local-file:/u/connolly/web/rfc/* (try this first)
=> ftp://ds.internic.net:/rfc/* (try this next...)

The same is true of mailing lists. When I'm browsing the www-talk
archive, I actually have local copies of many of the messages. We
could register a message-id:<id> scheme or even mailing-list:mbox/<id>
scheme. The I could map
=> local-file:/u/connolly/Mail/by-id/www-talk/*
=> local-file:/u/connolly/News/by-id/comp.text.sgml/*
=> wais://ifi.no/comp.text.sgml/TEXT/99999/*

I extended the grammar to include relative URIs, and I invented a way
to merge URNs into the URL namespace while still begin able to tell
them apart. A URL always looks like:
whereas a URN always looks like:
(i.e. no slash)

So we can begin to deploy things like:
by, for example, using the www_proxy mechanism in Mosaic.

Why is it necessary to distinguish URNs from URLs? To me, the
distinction between URNs and URLs is that URNs identify immutable
objects, and URLs identify mutable objects. Once you've resolved a
URN, you can keep that copy forever and use it to satisfy other
queries for that URN. As to the issues of versioning, translation,
etc., I'd say that a URNs may identify a set of documents, and the
versions, translations, etc. are elements of the set.

For example, the URNs
are elements of, say

The last URN above can't be directly resolved.

In many ways, the URN <rfc:rfc822.txt> is the same as the URL
<ftp://ds.internic.net/rfc/rfc822.txt>. But a WWW client has no
was of knowing that the ftp file is guaranteed not to change.

Hmmm... this isn't all coming together like I had hoped. The goal is
to deploy the more sophisticated "URCs" or IAFA-templates or whatever
is a scalable, distributed fashion. In the short term, I'd like to be
able to compose documents with references like:

<REFERENCE linkend="x1">RFC 822: Format for Internet Mail Messages
<urnloc ID="x1" locsrc="loc1"
DATE="19910434094433" EXPIRATION="19990101000000">rfc:rfc822.txt</urnloc>
<url ID="loc1" backup="loc2">local-file://ulua.hal.com/u/connolly/rfc/822.txt
<url ID="loc2" backup="loc3">wais://host/rfcs/.../822.txt
<url ID="loc3" backup="loc4">ftp://ds.internic.net/rfc/rfc822.txt</url>
<bibloc ID="loc4">
0822 S D. Crocker, "Standard for the format of ARPA Internet text
messages", 08/13/1982. (Pages=47) (Format=.txt) (Obsoletes
RFC0733) (STD 11) (Updated by RFC1327, RFC0987)

Anyway... this citation stuff is still muddling around at this point.
But I think I've got most of the URL issues hammered out, while
leaving room for URNs and allowing this stuff to be used in URCs.