Re: <draft-ietf-iiir-html-01.txt, .ps> to be deleted.

Peter Flynn (pflynn@curia.ucc.ie)
Tue, 15 Feb 1994 15:06:37 --100


> My major concern is that it implies a tremendous increase in the
> complexity of the HTML parser. With my original HTML specification, an
> HTML parser only parsed the instance part of the SGML document. With
> this HTML+ specification, WWW clients will have to parse the prologue
> as well.

If the prolog is valid SGML, this should not be a problem to code (just
borrow it) but what kind of slow-down will this mean for users? My guess
is that it won't be used heavily by most authors until SGML itself
becomes more widespread and they start using compliant editors.

Q. By "parse" do we mean "check for conformance and reject if in error"
as well as "extract anything which needs acting upon". The whole area
of what a browser should do when it encounters invalid HTML+ needs some
more thought.

Rob: at the Nov meeting, both Lou M and Chris W were of the opinion that
whatever the formalities of SGML, browsers ought just to make the best
effort they can and not reject a document (ever?). Bill P felt they
should be more critical. At the other extreme, browsers _could_ actually
refuse to display an invalid instance, and mail the author (or webmaster)
with a parser log (probably not a useful thing to do, though). Somewhere
between these lies a workable solution: what's the Mosaic team's take?

> You have, however, simplified matters by not putting any parameter
> entities in content models. This means that WWW clients won't have to
> deal with individual documents introducing new element types "on the
> fly."

This is what the RENDER tag is for, surely?

> But you've introduced OMITTAG, <!ENTITY> parsing, and lots of other
> stuff. If we plan to include a full blown SGML parser in every WWW
> client, why not use all the syntactic sugar like short references and
> cool stuff like that while we're at it?

Why?

> Let's get formal why don't we: I do not mean that we should be able to
> take any RTF file and convert it to HTMLPLUS, or MIF for that matter.
> But I think it's crucial that there exist invertible mappings
>
> h : HTML -> RTF
> and
> g : HTML -> MIF
> and
> h : HTML -> TeXinfo
>
> so that I can take a given HTML document, convert it to RTF, and
> convert it back and get exactly what I started with (the same ESIS,
> that is... perhaps SGML comments and a few meaningless RE's would get
> lost).
>
> >For instance, document text is forced to appear within paragraph elements
> >which act as containers. Documents are broken into divisions using the
> >new DIVn elements which give substance to the notion that headers start
> >sections which continue up to the next peer or larger header.

It's worth pointing out what several users have already said to me when
I suggested making <P> a container (let me quote from one of them):

>>> By the way, you get weird results when you try an SGML-based
>>> filter (e.g., Omnimark) on an HTML file in which P is empty.
>>> There's all that uncontained PCDATA sloshing around.

Omnimark is arguably the robustest *-->SGML-->* converter around. If
we want a completely (well, 99%) invertible conversion (which I think
we do), then we should look to sticking with concepts that are known
to work.

> If we're going to burden WWW clients with all this rich structure and
> OMITTAG parsing, why don't we go with something like DocBook, which
> has a proven ability to capture the structure of existing technical
> documents, in stead of trying to roll our own.

Because a large amount of what people may want to put up on web servers
is not necessarily technical documentation. This is why we went to the
trouble of bringing together the TEI and some browser writers. Using
DocBook would be _way_ too limiting.

> >Similarly, missing <P> tags can be inferred when the browser sees
> >something belonging to %text. This neatly deals with the common case
> >where some authors think of <P> as a paragraph separator or something
> >you should put at the end of each paragraph (this view is promulgated
> >by Mosaic documentation).

I think this is a dangerous path to tread. Yes, what you suggest could be
done, but I think it's going too far down the path of trying to let
plaintext documents masquerade as HTML.

> Is this form of inference consistent with the SGML standard? Or is
> this a non-standard extension to support legacy HTML documents?

Goldfarb discusses a similar concept in The Book. I haven't got it here
but I can look it up. It involves redefining record-end and record-start
so that they can act as GIs, I think.

> >My HTML+ browser works this way, using a top-down parser which permits
> >most elements to have omissable start and end tags, using the context
> >to identify missing tags. Each element is associated with a procedure.
> >Its easy this way to recover the structure of badly authored documents
> >e.g. with missing <DL> start tags.

It also makes it much easier for the user to foul things up if they don't
appreciate what they are doing. It's "bad" enough for non-SGML people to
cope with remembering to insert tags to do things, but if they have to
remember what the effect is of _not_ inserting tags, just when we were
getting to the stage of getting them used to SGML....well...

> >In future, we expect authors will use specialized wysiwyg editors for HTML+
> >or automated document format conversion tools and hence produce documents
> >which naturally conform to the DTD.

Don't even have to be WYSIWYG...but they will need to be conformant.

> Hmmm... as long as there are no un-broken documents that would be
> misinterpreted by these heuristics, I think it's a great idea. (Again,
> though, I'd like to see a formal argument that this is the case.)

Which is why we need to formalise the frozen HTML right now, including
making <p> a container, so that we have (a) a benchmark for existing
browsers; (b) a standard for authors to use that works properly and (c)
something we can base future HTML+ engines on for the bits where they have
to deal with legacy docs.

> >Actually, once you state that HTML is an SGML format, then formally each
> >document can extend the DTD.

I don't think that's meant to be the case. You can allow the inclusion of
entity definitions but I don't think you can let people rewrite the DTD
on the fly.

> > HTML+ merely exploits this to show authors
> >how to declare which extension they wish to use: forms, tables, figures etc.
> >I owe a debt here to Lou Burnard and the TEI DTDs which showed me how and
> >why to use this approach.

The TEI DTD is somewhat bigger and more complex than HTML+. If we were to
go for something else, this would be the one, not DocBook.

> >I have investigated HyTime compliance with Yuri Rubinsky and Elliot Kimber
> >(Dr Macro), and know how to add this in. At the moment though, most people
> >in the WWW community see little value in switching to a model which forces
> >you to declare hypertext links at the start of the document.

Huh. Authors of large documents see it.

> >will change if and when HyTime gets widely adopted. On the other hand, I
> >feel it is essential for HTML+ to conform to SGML. Without this, publishers
> >and businesses will tend to see WWW as a passing experiment that needs to
> >be replaced by something on a more professional/commercial footing. This is
> >why I am working so hard to extend HTML into something that meets publishers
> >and users expectations for document delivery. NCSA have done their bit - now
> >its my turn to roll up my sleeves and get down to serious programming :-)
>
> There's a lot of good stuff in this latest DTD. I think we need a more
> sophisticated, fault-tolerant linking element, and a few other things,
> but you might be on the right track.

I think Dave is. I'm a tad worried about the emphasis on publishing: still
way too many publishers think of SGML as a wordprocessor, and spend $000s
on in-house DTDs which implement dozens of attributes for each tag, giving
font info and hard-coded positional information. I would be much happier
seeing HTML+ start to educate them.

I'm going to inject a plea here for one small tweak to HTML+ which I
mentioned in November. A NUM attribute for <P>, <LI>, <DT> and <DD> for
recording (not displaying or predicating) the original numbering sequence
in cases where the HTML has been generated by a converter. This would
make finding the original a whole lot easier.

> p.s. I'd like to start some sort of html-successor-design discussion
> form. Is comp.infosystems.www, comp.text.sgml, or www-talk a suitable
> forum? Shall we create one?

I think it should carry the string `html' or `htmlplus' rather than `www'.
comp.text.sgml.html might do.

///Peter