Re: new HTML spec, sample implementation

Tim Berners-Lee (timbl@www3.cern.ch)
Tue, 12 Jan 93 09:39:27 +0100


> Date: Fri, 08 Jan 93 13:57:32 CST
> From: Dan Connolly <connolly@pixel.convex.com>
>

> This question seems to confuse two things: the ISOlat1 entity
> set, and the ISO Latin 1 character set. The first is mapping
> of names to glyphs, and the second is a mapping from the numbers
> 128-255 to glyphs. I think they're in alphabetical order
> by name, but not in order by the ISO Latin 1 character set.

I think we should specify ISO latin 1 as the base set. I think that
a lot of people in the nordic countries use it routinely and they
will go crazy if they have to use overload the crurly brackets again
as they have to with mail.

Therefore, we should allow those people who have 8-bit capability to
just stick in 8-bit codes. Admitedly I thought the ISO world kept to
the codes 21-7E and A1-FE hex for G0 and G1 graphics sets, using the
others for control sets (C0 and C1). Maybe ISO Lantin 1 has nothing
to do with ISO 8 bit extensions. Sorry I can't quote ISO numbers.
But whatever is common usage, let us have an 8 bit set.

(Anybody illuminate us on this? Anybody got the ISO Latin 1
character set listing by number?)

Now for died in the wool 7-bit hackers, is it fair to requier them to
remember numbers, or would it be nicer to allow them to put in
codes using entity names? Some people would I am sure like the
latter, but it is NOT important because we are aiming for wysiwyg
editors and so would regard human-readable character names as a
temporary thing anyway.

> Here is the crux of the matter:
>

> >The communication between it and the text object would have to be
defined in

> >terms of a particular character set
>

> And this character set is stated in the SGML declaration at
> the top of html.dtd.

No - that is something different. In the top of the DTD is specified
the reference base set for the DTD itself and SGML documents.
The interface between two software modules is something else and can
be independent of that.

> If we define HTML in terms of the
> full ISO Latin 1 character set, then the parser can deal with
> &ouml, and pass it to the text object as a data character, just
> like an 'A' character. For X displays using iso8559 fonts, that's
> cool.

Sorry, is iso8559 = Iso latin 1? (I have no head for numbers >1 :-)

yes it is cool. Use Midas or Viola to look at the Hyper-G stuff and
it works very nicely.

> But on a PC or a Mac, that means the text object will have to
> scan all the data it gets and convert the Latin1 encoding to
> it's own. Yuck.

Yup. Big deal? Not really. Just a set of parallel tables. Peter
Flynn of the CURIA project is producing a lot of archived gaelic and
is currently dealing with a requirement for a line-mode browser which
can switch its characetr set depending on the terminal emulator the
reader is using.

Problems only occur if there are characters which can't be mapped 1-1
to the local set, and must be represented by more than one character
(like uumlaut -> ue, ae dipthong -> ae etc) AND you can edit, in
which case the original form must be preserved. In this case, passing
on of the entity is essential. But doing it for every character >127
would be a pain memorywise. So I would suggest that a configuable
table define which entities can be crunched down to a single
character in the local representation and the rest be passed on from
the SGML parser to the SGML app as external entities.

> >... and perhaps if there is more than one

> >contender the SGML engine could have a compilation option.
>

> Hmmm... One might argue that as long as we support conversion
inside
> the SGML parser for EBCDIC machines, we might as well support PC
and
> Mac character sets while we're at it.

Yes.

Tim