Re: ISO charsets; Unicode

HALLAM-BAKER Phillip (hallam@dxal18.cern.ch)
Mon, 26 Sep 1994 19:01:19 +0100


In article <8899@cernvm.cern.ch> you write:

|>Has a formal mechanism been considered for specifying various popular
|>coding standards, such as ISO 8859-7, ISO 8859-8, etc., and (perhaps
|>off in the future) Unicode?

Yes, it is a parameter to the text/xxx content type:-

text/html; charset=ISO8859-7

Or some such stuff.

|>The motivation for this question is essentially this: Several really
|>exciting developments are being stymied by the Web's largely ASCII/
|>English-only focus. As I discussed privately with several readers of
|>this forum, there is, for example, a project afoot (nearly complete)
|>to create a full lexicon and concordance of the Dead Sea Scrolls. I
|>imagine a system where users can look up words, and view the original
|>scrolls as inlined images. The problem is that the DSS are written
|>in Greek, Aramaic, and Hebrew.

This is a Mosaic problem, not a WWW problem. Mosaic can handle multiple
fonts but only one charset. At least one TBA browser supports mixed
character set documents.HTML/3.0 is better here as well.

|> Specially hacked clients are only just
|>recently arriving that can do Japanese and a few other languages. No
|>general solution exists. And (perhaps most importantly) there is no-
|>thing in the HTML(+) descriptions that allows one to specify when text
|>in one language ends and text in another begins, or to specify what
|>encoding system is being used for either. The few hacked clients I've
|>seen also are not really geared for display of arbitrary languages.

Hacked versions afor any particular language are easy to come by. There
is no browser that can display english, greek and Hebrew together at
present. This will change. At some point the difficult question of
mixing left/right scanning languages will have to be tackled.

|>The DSS project isn't the only one that appears stymied. There is a
|>Cushitic etymological database (say that with a mouth full) at the U
|>of Chicago that's machine readable, and comes replete with a standard
|>interface. The project head would be happy to plug it into the Web,
|>but again the Web only knows ASCII.

Here I suspect you need something quite a bit more sophisticated and which
is at least 6 months off. You need a highly modular browser and drop in your
own module into it. That type of research tends to need highly specialised
fonts and a lot more flexibility that first sight might imply.

|>Other projects afoot are a comprehensive Aramaic dictionary. Aramaic
|>is the language of parts of the biblical book of Daniel and Ezra, and
|>a stray verse in Jeremiah. There is a huge corpus of early Christian
|>literature written in it, as well as several fundamental Jewish docu-
|>ments like the Talmud.

Again for any ancient language I suspect you will need multiple character
sets for different periods, different script styles etc. RFC-822 is pretty
much the same in gothic or helvetica. But if you are discussing an ancient
text typeface questions can be very critical. This is especially so with
cuenniform or hyroglyphic texts.

|>Then, of course, there's the giant database project called ARTFL, which
|>essentially attempts to make the entire French literary corpus availa-
|>ble online. It's already here, and tied to the Web. But they have no
|>standard specs for how to allow users to input things as simple as an
|>accute accent over an "a". They have an extremely competent staff to
|>work on such problems - but I wonder: Should this _be_ a problem?

If the browser is a good one it should understand the accents in standard
ISO code as well as as entities. The entities are pretty much redundant
for a, v etc unless you have a derraged transport that is not 8 bit clean.

|> -> Richard L. Goerwitz
|> -> goer@mithra-orinst.uchicago.edu

Well since a large number of developers here have funny accents in their
name as I suspect your forebears would have done (Gvrwitz) extended Latin
is pretty much catered for. Greek is essential for the maths and so will
go in. Hebrew characters will probably arrive before mixing right/left
scanning.

If the 7bit mail transport strips off the accents then you might not
understand some of the above bits...

--
Phillip M. Hallam-Baker

Not Speaking for anyone else.