Re: Client-side highlighting; tag proposal

Steven J. DeRose (sjd@ebt.com)
Tue, 14 Mar 1995 15:49:45 +0500


At 1:35 AM 3/14/95 +0500, Joe English wrote:
>This is not much of an issue for HTML documents on the Web,
>since they tend to be small and are rendered as a single unit
>anyway. It's not like a browser is going to display the book of
>Leviticus and have to worry about a marked region starting in Exodus
>and ending in Deuteronomy.

On the contrary, that is *exactly* the problem. I do have Leviticus on a
web site, and although my server is kind enough to break it into net-size
chunks if/when asked, I sure do have to know whether there is some
long-distance thing in effect, otherwise we can't know to send whatever
start-tag caused it when sending a smaller piece.

>> Likewise, one cannot easily build a stack-based
>> formatter, e.g. that keys styles off the list of element types in one's
>> ancestry.
>
>This is only partly true, and irrelevant besides.
>If the browser is going to include this functionality --
>highlighting regions that may cross element boundaries --
>it can't use ancestor-driven style resolution in any case,
>regardless of how the regions are identified.

Your critique is incorrect. Existence proof: open a dynatext book, since
dynatext does in fact use "ancestor-driven style resolution" for SGML.
It quite happily supports "highlighting regions that may cross element
boundaries" -- just do a drag-select or a phrase search and watch. One
reason
the point Dave cites is relevant, is that highlighting can reasonably
be construed as a different animal from style resolution. In actual
practice, this has many advantages.

>As far as efficiency goes, the Tk text widget is quite efficient,
>and it doesn't use any hierarchical information at all;
>all formatting attributes are specified with discontiguous,
>potentially overlapping tagged regions.
>
>And lastly, you *can* use a single-pass parser with a stack-based
>formatter to keep track of marked spans.

Precisely my point: you must do O(n), not O(lg n). Is that not unfortunate?
If you only want to solve tiny cases, of course it doesn't matter how you
do it. But if you want a system that will last, you have to think more
about scalability. If Tk isn't using any hierarchical information at all,
then it's format control is a lot more limited than it need be.

>> An editor is in even worse shape. There is no way to validate
>> that such pairs even match, because "matching" is not a generic notion --
>> it has to be custom-built for each kind of pair.
>
>Any SGML parser can do the ID/IDREF validation, and HyTime reftype
>constraints can do (most of) the rest, if it's that important.

Sorry, but ID/IDREF can't do this. SGML cannot validate that your empty
elements come in pairs. You can make one end have an ID and one an IDREF,
but SGML cannot guarantee there is any particular *number* of IDREFs that
point to an ID (in this case, that number would have to be exactly 1). Nor
can SGML guarantee that the start of a span precedes the end. HyTime
reftypes don't help for this case -- all they do is let you guarantee the
element type of the IDREF's target (and actually any element whose content
satisfies the content model for the specified GI is accepted, even if the
GI is different).

This is really important, since without such checking it is very hard for
authors to balance their markers. This is especially true when they start
cutting and pasting, since it is easy to pick up a scope that contain one
such
end, and move it unknowingly with bizarre effects.

Out-of-line markers are pretty easy. I'll use Xanadu tumbler notation for
the treewalk instead of HyTime, for brevity, but you get the same effect
with treeloc followed by a leaf-level dataloc in HyTime, or via the
corresponding TEI structure:

<marked from="1.5.2.4" to="1.5.3.1">

Could hardly be simpler. Permit the first component of the address to be an
ID, and you've got a pretty robust system.

The advantages of out-of-line specs include:

* For cases like search-hit highlighting, you don't have to change
the document content itself -- the meta-information stays separate.
For example, a client could choose to discard them during 'save'.

* You can point even into things that can't contain IDs, such as
CDATA elements.

* You can use the same mechanism to point into, out of, and between
not just sgml or html docs, but graphics and other media.

* You can even annotate or otherwise link read-only data.

A note received after I started this reply asked if anyone knew what the
HyTime syntax is for tree-path locators, byte offsets from tree locations,
etc. It's easy, though one must be *very* careful about defining byte
offsets whether in HyTime or anything else -- it is not at all easy to
count across element bounds, because it introduces an interaction between
the parser (which actually knows offsets of things in the source), and the
higher-level application (which I hope only knows about the structures the
parser found!).

But at any rate, one method of doing this in HyTime is to chain a nameloc
that points to some element with an ID, to a treeloc that walks down a level
at a time by child number, to a dataloc that expresses the byte offset into
the leaf of choice. For example:

<nameloc id=n1 nametype=element>
<nmlist> sec37 </></>
<treeloc id=t1 locsrc=n1>
1 6 3</>
<dataloc id=d1 locsrc=t1>
10 5</>

This 'location ladder' would be reference via ID "d1", and it points to
5 characters beginning at the 10th character of the 3rd child of the 6th
child of the subtree with ID sec37. The initial "1" in the treeloc must be
there in case ID sec37 points (possibly indirectly) to a forest of nodes,
not a tree: in that case you'd use it to specify which tree of the forest.
For typical cases just put in "1" and don't worry.

The Text Encoding Initiative Guidelines (available online at various sites)
give another syntax worth considering, described with full BNF and semantic
definitions in section 14.2. The equivalent ladder would be:

<xptr target="ID (sec37) CHILD (6) (3) STR (10 14)">

Both of these syntaxes are highly powerful and flexible, and are already
formally standardized, proven, and available for adoption. Let's just use
one.

<shameless-plug>
Both syntaxes are discussed in much more detail in Steven J. DeRose and
David G. Durand: *Making HyperMedia Work: A User's Guide to HyTime*.
Boston: Kluwer Academic Publishers, 1994. ISBN 0-7923-9432-1.
</shameless-plug>

Steve