Indexing of WWW space (going one higher than HTML)?

Terry Winograd (Winograd@cs.stanford.edu)
Tue, 2 Aug 1994 10:12:37 -0800


This was in www-talk so most of you have probably seen it. Seems relevant
to various aspects of our work. --t

From: Paul Wain <Paul.Wain@brunel.ac.uk>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Indexing of WWW space (going one higher than HTML)?

Hi all,

I know this has been covered by this group on many occasions, (and I have read
up on what was said :), but we are hitting the age old problem of how to index
a web space.

The thing is that we are looking at possibly taking things one step further
than that though. Let me try and explain....

Finally at long last Brunel is looking at taking its Campus Wide Information
Service (how I hate that term since we now have 3 campuses!) from being a
research area to a fully fledged production type service. This is involving the
creation of a large ammount of HTML very quickly. For this work we basically
decided that the only way that we could effectivly do this was raw HTML since
we are talking about probably 50 pages of HTML in around 25 working days to
effectivly produce the "core" of the service and some example departmental
pages. This is a very short term solution.

All is going well so far (including the Biology department being one of the 1st
to activly join in). However we are going to soon have to start introducing and
maintaining a medium term approach.

What we would like to see is a way to tailor the way that a user sees the web
space here (also variable on whether the user is internal or external!) at the
same time as trying to introduce control over document revision and authorship.

Having looked around this brings a few questions to mind:

1) Is raw HTML the best approach for documents? I know that it is definatly the
best approach for transmitting them in this case but would some on the fly
conversion (being careful to send Expires and Last-modified headers) be better?
People have already started raising doubts here as to how easy it is to
maintain a HTML file so what I am really asking I guess is does anyone have any
expirence in this and if so what have they found/done about it.

2) There are 5 key bits of information about the document that we have
identified, that are a requirement for some form of centralised searching or
indexing:

a)* Document title
b) Keywords associated with the document
c) Short Description of the Document
d)* Document author
e) Document owner

The ones marked with a * should be displayed when the document is viewed, the
ones that arent should at least be in the document somehow. Is this a valid use
of the META tags? Or are there other methods we could use. If we went outside
HTML and accepted that we could convert back to HTML on the fly, (see question
1) what other markup language could we use. (I say markup language because I
think that say correctly annotated word processor files are NOT the way to go
here).

3) In question 2, I identified the document owner and author as being 2 quite
separate people. Aswell as this we foresee a position where both may be dynamic
over time. Actually I see the following situation:

Owner's name/contact is dynamic
Author's contact is dynamic
Author's name is static

So what schemes have people come up with for this. My initial feeling is that
it could be represented as something like:

<P>Contact details for the <A HREF="./1234.html">Author</A> and <A
HREF="./4321.html">Owner</A> of this page</P>

Which although would be clumsy would allow for details to change without
needing to revise 70 or so documents. Again, would using a higher level of
markup and producing HTML pages on the fly help to avoid this problem?

4) Finally, on searching the web space, have there been any advances in this
recently? People here seem to be against the ALIWEB style approach to indexing.
What other methods are people looking at?

I think that covers a lot of the holding points at the moment (except that we
still can't get sgmls to work correctly - but we are working on it :) Any
comments on this topic are appreciated,

Cheers,

Paul

.--------Paul Wain ( X.500 Project Engineer and WWW Person at Brunel)---------.
| Brunel WWW Support: www@brunel.ac.uk MPhil Email: Paul.Wain@brunel.ac.uk |
| Work Email (default): Paul.Wain@brunel.ac.uk (Brunel internal extn: 2391) |
| http://http2.brunel.ac.uk:8080/paul or http://http2.brunel.ac.uk/~eepgpsw |
`-------------------So much to fit in, and so little space!-------------------'