Re: Putting the "World" back in WWW...

HALLAM-BAKER Phillip (hallam@dxal18.cern.ch)
Mon, 3 Oct 1994 20:36:56 +0100


In article <8A10@cernvm.cern.ch>, fxrojas@nlsarch.austin.ibm.com ((Frank Rojas ) ) writes:
|
|> UTF-8 - A special byte encoding that makes full 8 bit transfer over 7
|> bit gateways safe. Since HTTP is defined to be 8 bit clean, this is
|> not needed.
|>
|>UTF-8 is based on FSS-UTF (File System Safe UTF ) developed at XOpen that
|>is an 8-bit encoding of UCS-2. My understanding this name has been registered
|>with the ECMA.

We have been looking into this over the past few days, UTF basiaclly overcomes
whinges about UNICODE doubling storage space. It is simply another content
encoding to deal with.

We are not looking at a scheme whereby tokens from the SGML parser are
converted internally into a 32 bit code, mapping this onto UNICODE. Thus
the &wazoo; nonsense can be dealt with by the lexical analyser. Imagine now
a feedback from the SGML parser proper to cause context switches of the
code converter, using such a scheme it is even possible to change encoding
in mid stream (usefull for handling JIS escapes).

A charset module can easilly be written to convert fairly arbitrary encodings
into UNICODE tokens. This can also do UTS, ASCII, ISO-8893, JIS, and whacky
Russian etc. encodings.

On the other side I am looking into a scheme of `multifonts' which allows
several X11 fonts to be compounded into a single UNICODE mapping. Because the
display module is directly engaged we can translate into the target font
character by character. This scheme means that the UNICODE stuff does not cause
increased internal storage requirements.

Because ASCII maps into the lower 8 bits of UNICODE anyway there is no speed
penalty for straight ASCII. Other forms are similarly efficient. The only problem
is when fonts map to overlapping parts of the encoding space.

So the content type is

text/html (default to ISO-8893-1)
text/html; charset=UNICODE
text/html; charset=UTF
text/html; charset=ISO..
text/html; charset=JIS
etc.

This is nice because we can then reuse the display module for text/plain.

> The only limitation
> is whether the browser has the font for the language and supports
> non-left to right languages (which a real international browser should).
>
>This is a sticky issue. I.e. we can not expect every client to be localized
>the same as the server...

For X11 this is a real pain because of the way fonts are handled. It is not
very easy to load up application specific fonts. Otherwise we could use URLs
as the transport mechanism.

We are currently using the standard X11 font distribution, MIT have free
fonts for Korean, Chinese and Japanese. There are several Hebrew ones. I have
metafont for hieroglyphs which I would like to have in X11 but the SeeTeX
stuff will not compile on my machine.

Anyone got a URL for UNICODE???

--
Phillip M. Hallam-Baker

Not Speaking for anyone else.