Re: ISO charsets; Unicode

Richard L. Goerwitz (goer@midway.uchicago.edu)
Thu, 29 Sep 94 09:24:08 CDT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Steven D. Majewski: "Re: Languages (was Re: Forms support in clients)"
Previous message: Nathaniel Borenstein: "Re: Languages (was Re: Forms support in clients)"
Maybe in reply to: Richard L. Goerwitz: "ISO charsets; Unicode"

Phil provides the following remarkably concise summary of bidirectional
wordwrap methods. His overall point is that a language and encoding at-
tribute are fine, but a direction attribute is useless, if not harmful.
He's convinced me.

My only worry is that most client implementors won't understand his formu-
lation. What I'm going to do, therefore, is quote his, then add one of
my own that essentially says the same thing but in different words. What
I'm essentially providing here is a full tutorial.

Save this if you are implementing clients or WYSIWYG HTML editors.

First Phil:

>The rules for [bidirectional left-right right-left] formatting are
>quite easy to work out:
>
>1) The margins of the paragraph are set by the language environment that
> the paragraph started in. Ie range left, range right, center, justify.
>
> This also sets the main scan order, the starting point for the
> typesetting after a new line.
>
>2) Typeset each word working out if it will fit, if there is not enough
> space start a new line.
>
>3) If the scan order changes then the space remaining on the line is
> calculated and typestting continues as normal, except that we do
> not finalise placing until either the line ends or we have
> another scanning order reverse.
>
> When this happens we fill in the offsets on the buffered
> segments so that the end of the text in the previous scan
> order adjoins the previous text.
>
>so if we have an imaginary text where e is left right scanning and h is right
>left :-
>
>eeeeeeee_1 eeeeeeeee_2 eeeeeee_3 hhhhhh_2 hhhhhhh_1 eee_1 hhhhh_1
>hhhhh_3 hhhh_2 eeeeee_1 eee_2
>
>
>Because we started left to right the second block of hhhh continues at
>the left margin.

Now me. First let me begin with an interactive description of how
Arabic would be typed into an English paragraph:

Let's begin on a very simple level. Let's say I want to na etouq
esarhp cibarA (i.e. "quote an Arabic phrase" written right-left).
Note how "quote an" ends up on the top line, while "Arabic phrase"
ends up on the second. If I were entering this via an interactive
system, what would happen is: As I added text, the Arabic letters
would appear right where the "n" is in "an" (i.e. "na"). The cursor
would not move. Rather, the Arabic letters would be "pushed" to-
ward the right-hand margin. When the margin was reached, the cursor
would hop down onto the next line at the far left, and begin pushing
letters toward the margin there. I.e., the cursor stays in place,
and text moves right until it hits the margin, at which point the
cursor hops to the far left of the next line, and once again, and
resumes pushing letters rightward, etc.

This is a fairly simple example, so let me quote a response to some-
one who asked for a kind of tutorial a few months ago. His questions
are marked as "> ". I am not quoting Phil, that is. The discussion
begins with a question regarding how cursor movement works when mov-
ing across a R-L region of text:

> I assume that cursor keys "<-" and "->" still work visually, that is
> assuming R-L text "54*321" and cursor marked with *, cursor key '->'
> (right) results "543*21". Correct?

Correct!

> Assuming situation "543*21", will action "delete-previous-character"
> result "543*1" and "delete-next-character" "54*21"?

The behavior of delete-next-character or delete-previous-character
will depend on the inherent directionality of the language being used
at that particular moment. You are quite correct about R-L languages.
The trouble is what happens when you are at a border between R-L and
L-R text. On this, see below.

> Assuming "abcd*e54321abcde", when "abcde" is L-R and "54321" is R-L
> text, then what is cursor position after application of 3
> cursor-right (->) actions and what are the intermediate states? Just
> simply go visually?
>
> abcde*54321abcde
> abcde5*4321abcde
> abcde54*321abcde

This is correct. Another method is to create an invisible border
between the L-R and the R-L text that does not appear on-screen.
It would be noticed only when doing what you have done above, i.e.,
to move from a L-R region to a R-L region (or vice-versa). When the
cursor reaches this border, and a cursor-right (->) action occurs,
the cursor should change color only, and not move (signifying that
it has crossed the invisible border. After this point, it moves nor-
mally, with the newly acquired color. Another method is to change
cursor shape. Still another method (which I've seen implemented)
is to move the cursor just a pixel or two. The user notices less
movement than he or she is expecting, and this quickly becomes
internalized as a region change.

The reason for all this is simple. If you are between L-R and R-L text,
and you execute a delete-previous-character command, it is impossible
for the user to know which way the cursor will go unless he has some
visual cue as to which region he or she is in! Now I can really an-
swer the Q above about deleting characters. It depends on which color
or shape or whatever the cursor is, i.e., on which region it's in. If
in a R-L region, the directionality of the delete-next and delete-pre-
vious commands is reversed.

> Assuming "abc*de54321abcde" and user starts selecting a text block
> with mouse from the marked position and drawing the mouse
> horizontally to the left over the R-L text section, what is the
> actually selected text in each intermediate state as the selection
> end point slides over the text? What is selected at point
> "abc[de54]321abcde", where '[' marks the selection start point and
> ']' marks the current end point (where mouse pointer is now)?

Okay, this raises some really fun and interesting questions. There
are several possible ways of doing this. One is purely visually.
That is, whatever you select is what you get. Internally, however, it
is probable that the 1 above will be near the "e." That is, the
beginning of the (say, Arabic) word "54321" is the "1", and this will
stand next to the end of the English word "abcde" **internally**. So as
soon as the cursor hits and crosses the invisible border, you are
essentially selecting the entire chunk of text from a to 1 (visually)
and from a to 5 (internally). If the pointing device (say the mouse)
is moved to the spot between the 4 and the 3 (abc*de54*321abcde), then
the defined region would be abc[de]54[321]abcde. This might sound
like science fiction, but it's quite intuitive for anyone who's typed
in bilingual text. Note that if I start with the pointing device
between the 4 and the 5 (abcde5*4321abcde), and move it leftward, I
will end up selecting text as expected (visually) until the invisible
border is reached separating English from Arabic text. At that point,
the defined Arabic region would become inverse. Let's say that I stopped
selecting text between the c and the d. The defined region would end
up being abc[de]5[4321]abcde - the same as if I had moved in the
opposite direction, i.e., from between the c and d to between the 4
and the 5.

Again, it's really quite intuitive for people who write bilingual text.
The Arabic, remember, "begins" at the "1" (internally, and not visually
that is). As a result if I decide to select characters from the
"e" to the "1", I am in reality selecting the end of one English word
and the beginning of an Arabic word.

Now let's get into the question of input order and wordwrap - which will
help clarify the difference between internal byte-stream order and ex-
ternal presentations.

Note first that the Arabic may extend over more than one line. Let's
use our pseudo-Arabic "54321" example once again, but place this string
close enough to the end of the line that it will have to be wrapped:

1: abcdeabcdeabcde21
2: 543abcdeabcdeabcd
3: abcde etc.

Note how the wrapping algorithm works. I finish typing English at the
last "e" on line one above. When I switch to Arabic, the cursor
stands in place, and the letters all move toward the right margin:

abcdeabcdeabcde*1
-> abcdeabcdeabcde*21

When the right margin is reached, the cursor hops to the beginning of
the next line, and the letters continue pushing to the right margin:

abcdeabcdeabcde*12
*3

-> abcdeabcdeabcde21
*43

-> abcdeabcdeabcde21
*543

Now I begin typing English again:

-> abcdeabcdeabcde21
543a*

One critical point: The wordwrap here assumes basic left-right
directionality. I.e. it assumes English or some L-R language as
setting the primary wordwrap method. If one selects **R-L** as the
primary direction, then the cursor begins on the right-hand side of
the line, and moves leftward for Arabic text. If one *then* types in
English, the cursor stands in place, pushing text leftward until it
hits the margin, then it wraps, placing excess characters at the right
side of the next line. The cursor then remains at the right side of
the next line, and pushes characters to the left. When the user
switches back to Arabic, the cursor leaps over the English text, i.e.
leftwards over the letters that have been pushed towards the left
margin, and the cursor resumes "normal" operation, advancing to the
left after each character is typed!

*1 (normal Arabic cursor movement; moves leftward after
-> *21 every char. is typed)
-> *321
-> *4321
-> *54321
-> a*54321 (language switch; L-R language begins; cursor stays
-> ab*54321 put; characters are pushed leftwards)
-> abc*54321

-> abc54321 (margin reached; line wrap occurs; cursor jumps to the
d* beginning of the next line; resumes leftward "push")

-> abc54321
d*

-> abc54321
de*

-> abc54321 (Arabic text resumes, cursor jumps to the end of Eng.)
*1de

-> abc54321
*21de

etc.

What I'm trying to emphasize is that there are two parameters that
need to be set at every instance. These are 1) the basic
directionality of the text, and 2) the directionality of the
particular language being typed. Phil already said this above, and
he's right on the money. I think we can get everything we want here
by generalizing clients' notions of wordwrap, and by adding a lang-
uage and encoding attribute to the HTML standard.

I don't like the old TEI method of identifying language with encod-
ing (e.g. language="ISO 8859-1"). It's not flexible enough. But I
love the idea of separate language and encoding attributes, and I
love Phil's notion that directionality should always be "natural",
and we should not bother with an explicit direction attribute.

Anyone actually read this far? I have lots of other things to do,
but I felt it was important to set aside a significant chunk of
my time in order to do a good job of explaining.

-Richard L. Goerwitz goer%midway@uchicago.bitnet
goer@midway.uchicago.edu rutgers!oddjob!ellis!goer

Next message: Steven D. Majewski: "Re: Languages (was Re: Forms support in clients)"
Previous message: Nathaniel Borenstein: "Re: Languages (was Re: Forms support in clients)"
Maybe in reply to: Richard L. Goerwitz: "ISO charsets; Unicode"