An HTML specification and Implementors' Guide

Dan Connolly (connolly@pixel.convex.com)
Mon, 30 Nov 92 07:27:01 CST


I just uploaded

html_spec-0.3.tar.Z

to info.cern.ch in pub/incoming.

It's hypertext including

* MarkUp.html -- the root node
* Text.html -- an introduction to SGML syntax
* html.dtd -- the spec expressed in HTML
* several example files that form a validation suite
* libHTML.tar -- some code that implements the low-level
SGML reading state machine (with a test driver)

Tim: please link this into the web somehow.

Implementors: please grab the whole thing and validate
your implementation against it.

Tony: I've got some patches for the MidasWWW browser.
I'm not quite done cleaning them up.

Linemode fans: I haven't started messing with linemode
yet.

Issues Closed Pending Review:

Long Names

I included an SGML declaration that increases NAMELEN to 34,
and LITLEN to 1024. I got these numbers from the DocBook DTD.

SGML IDs for Anchor Names

The NAME attribute of the A element is an ID. It must start
with a name, and it must be unique among all the IDs in
the document. [Note that there is no way to validate the #anchor
part of the HREF attribute. I'm working on that...]

Multimedia Links

I included a content-type attribute for links so that you can tell the
browser what type of data you're pointing to, and it can decide what to
do with it (at a minimum, use this attribute and pass the data to
metamail). I added a content-description attribute in case you want the
reader to be able to get some information about the data without
transfering it, but now I'm not sure it's a good idea. The description
should go in the content of the A element.

Formatted Text with Anchors

I took the semantics of the PRE tag, added the WIDTH attribute, and
called it TYPEWRITER (inspired by the nroff man page). It's parsed like
most other elements, but displayed like XMP or LISTING or PLAINTEXT.

Newline handling isn't a parsing issue -- it's a display issue. I think
it will be more straightforward to define newlines in TYPEWRITER
content to be significant. That way, once the data is parsed, XMP
and TYPEWRITER work just the same. Lines may get real long. That's
life. If you want to mail it, use MIME or uuencode or something.

XMP and LISTING elements are CDATA: they have no markup in their
content. There's no way to put </TITLE> inside an XMP element.

PLAINTEXT is an empty element that signals the end of a text/html
entity and begins a text/plain entity.

Ordered Lists

I included them in the DTD. Any objections?

ISO Latin 1 Characters:

I included a reference to "ISO 8879:1986//ENTITIES Added Latin 1//EN"
in the HTML DTD. This defines entities for all ISO latin 1 characters.
Clients will need a table of the names and local translations.

Open Issues:

Highlighting: Who's tags should we use? LaTeX seems to be an adequate
markup system for lots of folks. Its tags are
em | it | bf | sf | sl | tt

The DocBook folks use only semantic tags: they don't have bold or italic
tags. The MIME richtext stuff has only typographic tags: no <emphasis>
or <booktitle> or any such thing.

Dan