Re: Different character sets in one HTML document

Pieter van Zee (piet@hpcvusm.cv.hp.com)
Thu, 23 Jun 1994 13:09:52 -0700


I've included an excerpt from a private e-mail conversation
that relates to the question of charset encoding.
------------------------------------------------------------

> ... queried why one needed to specify the charset on each element.
> Wouldn't it be suffient to specify the ISO 2022 mechanism at the
> MIME level and leave it to the escape mechanism to specify shifts
> between character sets?
>

I'll restate for clarity:
My objective is to support multi-lingual content, i.e. to move
away from the assumption that the entire content of an HTML file
is in a single charset. Such documents are quite useful, e.g.
cross-language dictionaries, newspapers, academic publications,
etc. The proposal of putting charset on each element is one way
to do this. Because there are many charsets that are appropriate
for any given lang value, the charset is one way to uniquely
identify the encoding.

Assuming we agree that HTML documents need to support
multi-lingual content, let's discuss how this might occur. I ran
the following by our i18n guru to verify my comments.

The phrase "specifying the ISO 2022 mechanism at the MIME level"
isn't exactly clear to me. I'll take it to mean that whenever a
HTML document is encapsulated as a MIME object for transport, the
document must use ISO 2022 encoding for its content.

Let's generalize and call this:

Strategy (a): a HTML document has only ISO 2022-encoded content.

And my proposal is:

Strategy (b): every HTML element has optional LANG and CHARSET
attibutes which specify the locale of the element's data.

In other words...A HTML document uses 7-bit ASCII for
markup but may use any charset for content, and charset is
specified in two ways: (i) an optional default charset for the
document, and (ii) an optional charset attribute on every
element that overrides the document default.

What are the relative merits and pitfalls?

The short answer is that we can achieve the same end result with
either strategy (a) the LANG attribute plus ISO-2022 encoding of
the content or (b) LANG and CHARSET attributes on elements and
content in that charset.

The longer answer is that the difference in effort for someone
coding up and maintaining a parser, viewer, or translator is
substantial. The effort differential arises because the ISO-2022
approach isn't well suited to leverage the existing operating
system infrastructure to support development. I guess I'll
contend that we want to avoid making it hard for developers if
reasonable alternatives exist.

Basically, with strategy (a), every program must know how to
parse a ISO-2022 byte stream and map that to something meaningful
on their platform. This means on a per-program basis developing
lots of tables and code to parse the byte stream and use the
tables appropriately, such as to use the X11R4 mechanisms
directly to load fonts. To support any new encodings or font
sets, the tables and/or the program must be revised. Note that
this approach is also problematic for PC-based clients.

Also, although the ISO-2022 mechanism supports baseline charset
specifications, it does not support higher-level specifications
that combine two or more baseline charsets. These aggregate
charsets, such as Japanese SJIS and EUC, are the charsets that
users are exposed to and which have OS infrastructure support.

With strategy (b), several advantages accrue. Because the LANG
and CHARSET attributes can be combined to create a string
suitable for setlocale(3C), a developer can then leverage all the
infrastructure code inside the multibyte(3C) and X11R5 library
code that knows how to parse byte strings in a given locale and
work with font sets (families) to allow an aggregate charset to
be rendered in full. Also, because ISO-2022 parsing and
translation doesn't have to occur, there is a performance gain.

In addition, other locale-specific OS capabilities can be
accessed, such as for collating strings, and monetary, number,
and time formats. Further, because these capabilities are in the
OS and not the program, the program automatically benefits from
infrastructure revisions and new capabilities. This seems
especially useful given the evolving standards on both PC and
workstation platforms.

Finally, it seems to me that strategy (b) is a superset of
strategy (a). Using strategy (b), for example, one could specify
that the default charset of the document is ISO-2022 and achieve
strategy (a) with no further effort, while strategy (a) does not
accomodate strategy (b) at all. Using HTTP format negotiation and
an appropriately equipped server, one could imagine servers that
translate documents from one encoding to another (as best as
possible) according to the capabilities of the viewer.

Piet van Zee
piet@cv.hp.com