Re: Semantics of <Hn>

Klaus Harbo (
Thu, 12 Aug 1993 10:56:04 +0200

Tony Sanders wrote, commenting on Nathan's posting:

> > As it seems to me, <H1>text</H1> is being thought of by some people as
> > defining a region of text (the text until the next <H1>) whereas it
> > ought (I think) to be simply interpreted as placing a heading. As
> > such, it ought to be able to occur anywhere a <P> can, as well as
> > after <DT><DD> pairs in description lists.
> >
> > Whew. Any comments?
> No, No, No.
> That about covers it I think
> --sanders

Fortunately this not an issue of opinion, since SGML DTDs define the
structure of document instance parsetrees unambigously (provided, of
course, that DTD was designed properly).

Excerpting from the current HTML DTD:

<!ENTITY % heading "H1|H2|H3|H4|H5|H6" >
<!ENTITY % bodyelement
"P | A | %heading |
| %literal">
<!ENTITY % inline "EM | TT | STRONG | B | I | U |
<!ENTITY % text "#PCDATA | IMG | %inline;">
<!ELEMENT BODY - - (%bodyelement|%text;)*>
<!ELEMENT ( %heading ) - - (%text;|A)+>

H1 elements can only contain one or more A elements or the elements
allowed by %text;. If anyone thinks of it of being any more than that,
they are in error.

Reading this list I sometimes worry that some people think that most
of the issues are matters of opinion when really they are not.

WWW allegedly uses SGML. However, when I read the HMTL+ spec [HTML+
(Hypertext Markup Language) by Dave Raggett, my version is dated 12
July], I sometimes wonder if it is really SGML you are talking about,
or just some lookalike clone. I quote:

"Please ensure that browsers can tolerate bad markup. In
practice, this is straightforward to achieve, provided that a
naive top-down SGML parser is avoided. A forgiving parser
should be able to cope with tags in unexpected positions,
e.g. the <A> tag bracketing a header [footnote omitted].
Unknown tags should simply be ignored."

HTML+ spec, p. 4

SGML was specifically designed to make SGML document instances LL(1)
parsable. This also makes them top-down parsable. Of course I can
interpret the above statement as "naive top-down parsers exist", but I
rather read it as "top-down parsers are naive", which is non-sense
when you're dealing with LL(1) grammars.

However, the real problem is that the authors are telling us that we
are not really dealing with documents that are parsable according to a
grammar that is known in advance. At any time we can encounter
"unknown tags" or "bad markup".

One of the great things about real, hard science (math, physics, ...)
is that you can use other people's work (proofs, results) in your
own. Likewise, one of the major benefits of using SGML is that you can
use publically available parsers (eg. ARCSGML, SGMLS) to parse
document instances instead of writing a parser yourself. In computing,
people reinvent the wheel all the time, which is a waste of time. In
this particular case the HTML+ spec forces people to reinvent the
wheel by writing parsers over and over.

Of course I realize that the quoted statement is intended to make the
information providers lives easier, but the real solution for that
problem is to give providers tools that will let write and translate
their stuff to a proper, unambigous format; the solution is not
letting them "do what they want" with regard to tagging.

Tolerating bad markup will create endless problems in the long
run. The way HTML+ is formulated, there is no way to tell if a
document is HTML+ compliant or not.

Was this issue discussed at W^5? Comments, anyone?

Regards from a WWW lurker and SGML heretic,


|  Klaus Harbo                   | e-mail: |
|  Euromath Center   (EmC)       | phone (direct):           +45 3532 0713 |  
|  Universitetsparken 5          | phone (sw.board):         +45 3532 1818 | 
|  DK-2100 Copenhagen            | fax:                      +45 3532 0719 |
	The opinions stated here are not necessarily those of
	     the Euromath Center or of the Euromath Project.