Re: ISO charsets; Unicode

Jeff Smith (sumisu@slab.ntt.jp)
Tue, 27 Sep 1994 11:30:36 +0900


If you haven't noticed, Motif doesn't allow the mixing of character
sets in a single text widget - it takes more than a hack of the client
to display multiple character sets (e.g. Hebrew, Greek, Japanese) on
the same "page."

The only way to do this - I haven't tried - would be to use Mule.

js

|>In article <8899@cernvm.cern.ch> you write:
|>
|>|>Has a formal mechanism been considered for specifying various popular
|>|>coding standards, such as ISO 8859-7, ISO 8859-8, etc., and (perhaps
|>|>off in the future) Unicode?
|>
|>Yes, it is a parameter to the text/xxx content type:-
|>
|>text/html; charset=ISO8859-7
|>
|>Or some such stuff.
|>
|>
|>|>The motivation for this question is essentially this: Several really
|>|>exciting developments are being stymied by the Web's largely ASCII/
|>|>English-only focus. As I discussed privately with several readers of
|>|>this forum, there is, for example, a project afoot (nearly complete)
|>|>to create a full lexicon and concordance of the Dead Sea Scrolls. I
|>|>imagine a system where users can look up words, and view the original
|>|>scrolls as inlined images. The problem is that the DSS are written
|>|>in Greek, Aramaic, and Hebrew.
|>
|>This is a Mosaic problem, not a WWW problem. Mosaic can handle multiple
|>fonts but only one charset. At least one TBA browser supports mixed
|>character set documents.HTML/3.0 is better here as well.
|>
|>
|>|> Specially hacked clients are only just
|>|>recently arriving that can do Japanese and a few other languages. No
|>|>general solution exists. And (perhaps most importantly) there is no-
|>|>thing in the HTML(+) descriptions that allows one to specify when text
|>|>in one language ends and text in another begins, or to specify what
|>|>encoding system is being used for either. The few hacked clients I've
|>|>seen also are not really geared for display of arbitrary languages.
|>
|>Hacked versions afor any particular language are easy to come by. There
|>is no browser that can display english, greek and Hebrew together at
|>present. This will change. At some point the difficult question of
|>mixing left/right scanning languages will have to be tackled.
|>
|>
|>|>The DSS project isn't the only one that appears stymied. There is a
|>|>Cushitic etymological database (say that with a mouth full) at the U
|>|>of Chicago that's machine readable, and comes replete with a standard
|>|>interface. The project head would be happy to plug it into the Web,
|>|>but again the Web only knows ASCII.
|>
|>Here I suspect you need something quite a bit more sophisticated and which
|>is at least 6 months off. You need a highly modular browser and drop in your
|>own module into it. That type of research tends to need highly specialised
|>fonts and a lot more flexibility that first sight might imply.
|>
|>
|>|>Other projects afoot are a comprehensive Aramaic dictionary. Aramaic
|>|>is the language of parts of the biblical book of Daniel and Ezra, and
|>|>a stray verse in Jeremiah. There is a huge corpus of early Christian
|>|>literature written in it, as well as several fundamental Jewish docu-
|>|>ments like the Talmud.
|>
|>Again for any ancient language I suspect you will need multiple character
|>sets for different periods, different script styles etc. RFC-822 is pretty
|>much the same in gothic or helvetica. But if you are discussing an ancient
|>text typeface questions can be very critical. This is especially so with
|>cuenniform or hyroglyphic texts.
|>
|>
|>|>Then, of course, there's the giant database project called ARTFL, which
|>|>essentially attempts to make the entire French literary corpus availa-
|>|>ble online. It's already here, and tied to the Web. But they have no
|>|>standard specs for how to allow users to input things as simple as an
|>|>accute accent over an "a". They have an extremely competent staff to
|>|>work on such problems - but I wonder: Should this _be_ a problem?
|>
|>If the browser is a good one it should understand the accents in standard
|>ISO code as well as as entities. The entities are pretty much redundant
|>for a, v etc unless you have a derraged transport that is not 8 bit clean.
|>
|>|> -> Richard L. Goerwitz
|>|> -> goer@mithra-orinst.uchicago.edu
|>
|>Well since a large number of developers here have funny accents in their
|>name as I suspect your forebears would have done (Gvrwitz) extended Latin
|>is pretty much catered for. Greek is essential for the maths and so will
|>go in. Hebrew characters will probably arrive before mixing right/left
|>scanning.
|>
|>If the 7bit mail transport strips off the accents then you might not
|>understand some of the above bits...
|>
|>--
|>Phillip M. Hallam-Baker
|>
|>Not Speaking for anyone else.