new HTML spec, sample implementation

Dan Connolly (connolly@pixel.convex.com)
Wed, 06 Jan 93 19:23:43 CST


I just uploaded the following to info.cern.ch:/pub/incoming
libHTML-930106.tar.Z
html_spec-930106.tar.Z

WHERE DO WE GO FROM HERE?

* registering HTML with the IANA

The spec is a hypertext. We need a plain text document
for the IANA. This is complicated by the fact that
much of the spec is "by example," that is, tolerated.html
demonstrates the tolerated techniques much better than
it explains them.

But I think the files HTML.html, Text.html, and html.dtd
make a workable spec. html.dtd has all the information,
HTML.html motivates it, and Text.html gives enough background
to read it.

* bringing implementations into compliance

LineMode -- Tim: I'd like to use SGML_read to do the lexical
stuff in the linemode browser. I haven't thought much about
EBCDIC support, but it shouldn't bee too difficult. I think
SGML_read will fit neatly between HTParseFormat() and
HTGetCharacter().

NeXT browser -- Tim: I'd like to see the stuff on info.cern.ch
use the HEAD/BODY elements and &#60 in stead of &lt. If you
use the NeXT browser to maintain this stuff, or if anybody else
uses the NeXT browser, I'd like to see it brought up to date.

html-mode.el -- Marc: I have kept my copy of html.el up
to date as I have edited the spec. We should sync up.

MidasWWW -- Tony: I have kept my copy of midaswww up
to date as I have worked on the spec. We should sync up too.

Gateways -- Again, I request that anybody who provides
information to the web to keep their server up to date
with this spec. It's the only way to motivate updates
to clients!

Now about that last few changes to the HTML spec...

INLINE ELEMENTS

I added <em>, <samp>, <code>, and several other elements,
inspired by TeXinfo. We need these to support conventional
technical documentation. The list is not exhaustive, but
I think it's pretty good.

NUMERIC CHARACTER REFERENCES

I have learned a few things about SGML and made a few
decisions biased toward simplicity. As a result, I think
the spec is a little smaller and the sample implementation
is a little cleaner.

Most notably, I have introduced numeric character references
to the HTML spec. These were in SGML all along, but I didn't
understand them fully.

This raises the issue of character sets. The character set
in html.dtd is ISO646, i.e. ASCII. Everybody using html.dtd
agrees on the correspondence between the numerals 0-127
and the ASCII characters they represent. So to represent
a '<' character, we'll write "&#60;". This obsoletes
the lt, gt, and amp entities.

On the other hand, I did not include an 8-bit character set.
So the meaning of "&#255" is not defined. The HTML DTD references
"ISO 8879:1986//ENTITIES Added Latin 1//EN" in stead of
"ISO Registration Number 109//CHARSET ECMA-94
Right Part of Latin Alphabet Nr. 3//ESC 2/13 4/3". So We'll write
"&yuml;" for the character that corresponds to position 255
in the ISO-8559 encoding.

In the sample implementation, numeric character references
are invisible to the application: the translation from
"&#60;" to '<' happens inside the SGML_read routine. On
the other hand, entity references like "&yuml;" are
handed back to the application for processing.

Dan