Re: ISO charsets; Unicode

Stavros Macrakis (macrakis@osf.org)
Fri, 30 Sep 1994 11:50:52 -0400


Judith--

I certainly agree that library-quality transliteration is not trivial,
and in many cases is impossible. As you say, there is no reasonable
way to transliterate Japanese Kanji, and many scripts have multiple
transliterations. And even having a language attribute is in general
insufficient -- as you point out, there are multiple systems for
transliterating many languages (this of course should be a client-side
setting). In any case, as I said in my last note, language tagging is
useful for many things. It is simply not critical in the way encoding
tagging is.

But all of this goes far beyond my original point, which was simply
that browsers which are unable to display, say, Arabic, can choose to
display a transliteration instead (NOT a transcription, which is in
general impossible for unvocalized Arabic). If Arabic comes to me in
an HTML document, and I don't have the appropriate fonts, I would much
rather see "'l'skndry2h" than "#########". With the former, I have some
chance of recognizing it as al-Iskandariyyah i.e. Alexandria.

An additional related point: The English-Arabic dictionary is one
thing, but how about Armenian-Russian or Arabic-Chinese? One
object I heard from the Russians and Balts that I have spoken to is
that even the attempts to standardize on expanded character sets
have tended to ignore THESE kinds of mixtures, showing a kind of
western europe-fixation that does not solve THEIR problems.

No one is "ignoring" these problems. There are two basic ways to
solve them. 1) Define a way of mixing different encoding systems in
one document and 2) Define a universal character set which covers all
necessary characters. Both ways work, although solution (2) is
cleaner and simpler -- but requires 16 bits/character. I suspect that
(1) and (2) will have to coexist for the forseeable future.

Solution type (1) requires conventions for switching codesets in the
middle of a document, as is done by ISO 2022 (?), and the proposals
being discussed in this thread. This solution is rather clumsy, but
works.

Solution type (2) requires defining a universal character set. There
is in fact such a character set, namely Unicode/ISO 10646, and it is
apparently being seriously considered for use in HTML.

Unicode covers just about every writing system you can think of,
including Latin, Greek, Cyrillic (with all the different extensions),
Arabic (with all the different extensions), Hebrew, Armenian, Laotian,
Chinese (both ideograms and Bopomofo), Japanese (kanji, hiragana,
katakana), Korean (both ideograms and Hangul), Devanagari (with all
the different extensions), Oriya, Tamil, and many other Indian
languages, Tibetan, Mongolian, Ethiopian, and many many others.
Recent proposed extensions include Aramaic, Cherokee, Etruscan,
Glagolitic, Linear B, Ogham, Old Persian Cuneiform, Ugaritic
Cuneiform, Northern Runic, Epigraphic South Arabian, etc. It is true
that there are some important scripts that are missing, mostly because
there are serious open philological questions. This includes the
Akkadian, Sumerian, and Babylonian cuneiform systems, Hittite, Linear
A, etc. If you have questions about how Unicode works, etc., I
suggest you take them off-line, because this is peripheral to WWW.