Re: ISO charsets; Unicode

Stavros Macrakis (macrakis@osf.org)
Wed, 28 Sep 1994 11:09:46 -0400


> The LANG attribute is essential for handling text which reads right
> to left rather than left to right....

>Actually, all that is needed is a unique identification of each
>presentation character as right-to-left or left-to-right. If a viewer
>encounters the logical sequence of letters Arabic-J Arabic-M Arabic-L,
>is presenting them using Arabic script, and has an Arabic font to
>present it in, it should display the glyphs...

I have to agree with Dave here, though you are persuasive enough,
I'll admit.

"Persuasive", perhaps, but apparently not "enough"....

The basic point is that various coding schemes overlap. You can't
assume that everyone will jump on the Unicode bandwagon right away.
In come contexts, 8-bit characters will always be with us. So we
are left with only one option - a LANG attribute, plus some other
attribute designating the encoding scheme used.

I was not assuming that everyone will be using Unicode immediately. I
was assuming that any character in the HTML can be unambiguously
identified. This only needs an encoding-system attribute, not a
language attribute. The rendering system does _not_ need to know
whether a given letter "t" in my text is being used to write German,
Chinese in Pinyin, or Maltese. It does not need to know whether a
given letter Arabic-sin is being used to write Arabic, Persian,
Ottoman Turkish, or Greek (yes, I have seen Greek written in Arabic
script), or whether a given letter Hebrew-aleph is being used to write
Hebrew or Yiddish.

Just in theoretical terms, it's more pleasing to talk about
languages then characters, anyway. After all, I can write a glyph
any way I want. What determines how the glyph relates to other
glyphs is the script system it belongs to.

Part of the problem here is terminological. Unfortunately, there is
no standard terminology. In any case, I think we agree that the
_script system_ of each character needs to be unambiguously
identified--we would not want to encode an Arabic-L in the same way as
a Latin-L.

>If however the viewer cannot display Arabic script, or if the user
>prefers Latin script (perhaps s/he doesn't even read Arabic script,
>but is consulting the etymology of a word that comes from Arabic in a
>dictionary), it may well choose to present it in transliteration as
>"jml" in that order.

This is a really thoughtful point, and frankly it had not occurred
to me before. There are, indeed, international standards for
trans- literating Arabic, as for many other languages. Your idea,
though, is not practical because there aren't always one-to-one
correspon- dences. Take, for example, the classical Hebrew shwa.
How do we do it in English? First of all it is a diacritic.
Secondly it is pronounced differently in different contexts -
sometimes as nothing at all. It's a bit like rendering English
"wine" in a foreign script. Do we transliterate the final -e?
Unfortunately, transliteration requires more than a simple mapping
of one charset to another. Knowledge of the underlying language is
required. So I vote that we stick with a LANG attribute.

Note that I said "transliterate" and not "transcribe". We certainly
don't have enough information from the Arabic text j-m-l to know
whether to transcribe it as jamal or jumil or whatever. However, we
do have the information to transliterate it as jml. I agree that we
can do a better job of transliterating if the language is identified,
since in some cases different transliteration systems are used for
different languages using the same script. But this is a second-order
effect. As I said in my last message, identifying the language _is_
useful, but not essential, as is identifying the script system.

If a client runs into Arabic, and can't display Arabic, then it's
out of luck. I don't think that automatic translation into a Latin
font is practical for enough cases to warrant building it into the
clients along with everything else we're proposing.

This is up to the client writers to decide. The HTML spec should
simply provide the necessary information. Let me give you an example
of the usefulness of the functionality I sketched. The 11th edition
of the Encyclopedia Britannica (which has a good chance of being
online soon, for free) includes citations in Greek and Hebrew in the
original script. I am familiar with Greek script, and would like to
see it as Greek. However, I am not familiar with Hebrew script, and
would prefer to see it in transliteration. The OED includes even more
scripts in the original form....

-s