Re: International Document Server Support

Dave_Raggett (dsr@hplb.hpl.hp.com)
Wed, 8 Dec 93 10:28:19 GMT


> Are these character sets all 8bit sets, or is there support for
> multi-byte characters. I am not familiar with the way things work on X,
> but on a Mac, Chinese and some other languages use 2-byte characters.

These can be multi-byte character sets. See RFC 1468 and also
see http://www.ntt.jp/japan/note-on-JP/ which explains how NTT has
patched the WWW library to work with the 7 bit ISO-2022-JP, which is
widely used for email and network news.

Basically, ISO-2022-JP uses escape sequences to switch between
ASCII and JIS X 208. The latter is a two byte character set including
Kanji, Hiragana, Katakana and some other symbols and characters.

ESC ( B ASCII
ESC $ B JIS X 0208-1983

NTT have patched libwww to only recognise markup in ASCII text. This
is a simple change and only effects SGML.c. This approach allows one
to use the full range of 8-bit character sets, but for portability it
is essential that we stick to the ISO registered escape sequences and
character sets.

If browsers use the Accept-Language: header correctly, then we can
avoid the problem where the browser doesn't have any fonts for the
designated character set. The change needed to Mosaic to support this
scheme isn't bad, and there are already patched versions of

tkWWW browser/editor for X11
emacs browser
line mode browser
X Mosaic 1.2 for X11 (in alpha version)

Marc Andreessen writes:

> You know, there's not a chance in hell we'll be able to support this
> in the forseeable future...

I think it is important that we don't take an English-centric view of the
world. The escape sequence mechanism and support for Accept-Language and
multi-byte character sets seem like a good step in the right direction.

Dave Raggett