Re: Structure vs. appearance in HTML

Philippe-Andre Prindeville (philipp@res.enst.fr)
Sat, 23 Sep 95 06:52:17 +0200


On Sep 22, 8:35, Jon Wallis wrote:
> >This last one is dubious. I have no way of saying, find me all
> >occurences of "Sprint" (as a proper noun, ie. name) in a document
> >or set of documents, skipping "sprint" the verb or noun. Obiously,
> >"... winning the men's 100m sprint." does not pertain to
> >telecommunications or American corporate culture.
>
> If the author has used a capital S for "Sprint" the company you'd then have
> a chance of parsing it properly with search engine.

Uh, no. Not quite.

Making case dependent searches is dubious. Would you know whether
the Dutch name is spelt "van Buren" or "Van Buren"? Maybe not.

And sometimes, press releases in machine readable form have their
lead-in sentence in capitals.

BEN JOHNSON DISQUALIFIED FROM 100M SPRINT AFTER FAILING DRUG TESTING.

So, what would the proper case for "sprint" be here?

Relying on cases is just plain naive. What about all the T.S. Elliot
poems written entirely in lowercase? Should these be excluded from
searches? I think not.

> Also, if the search tool does promixity indexing you could index the page
> according to what other terms occurred near to "sprint".

Someone here at the university is doing his PhD on this subject.
While I have great respect for him, he has been in his "last year"
on this thesis for about 36 months now... I don't think I'll
ever get to buy him a round of drinks.

A good idea, but... no. It's time has not yet come.

> Better still, if the page were *classified* (e.g., using Dewey), you'd know
> whether the page was about athletics (796) or telecomms (384). Page
> classification makes searching much more powerful (especially if combined
> with text indexing). Then if you wanted to find all occurences of "Sprint"
> (as a proper noun, ie. name) in a document you could choose only to search
> documents that were about telecomms or corporate culture or whatever.

Fine. But would you search for articles by Richard Feynman in
(a) mathematics (b) physics (c) biology or (d) chemistry? I would
look in all three domains.

You can't but blinders on in science. All too often interesting
developments take place in parallel, and the person who sees things
broadly is the one that changes the paradigms with which we think.

> The only alternative I can see would be to tag *every* word with its
> grammatical value (noun, verb, whatever), on the basis that it *might* be
> useful to someone, somewhere, sometime in the future. Not a pleasant prospect.

Go back and read your posting and tell me how many proper nouns
were in the message... and then explain to me why tagging them
would have been painful.

> >We are creating and stocking quantities of information that will
> >be used well into the next century. Machines will be used to
> >search these enormous quantites of data. If it isn't tagged
> >meaningfully now (at its inception), it never will be. And that
> >will be a real shame.
>
> I entirely agree. But I suspect it's battle that's already being lost.

What if we had surrendered to the Japanese before Midway?

> >> * Render it on just about any output device, with reasonably
> >> good results.
> >
> >Whoopie.
>
> I can't tell if this sarcastic or not. Platform-independent display is a
> real bonus.

It's pretty. Nothing more. Well, not much more. (yeah, I'll
probably get flamed for this, but who cares...)

-Philip