Re: Wanted: HyperMail Done Right

Gerald Oskoboiny (gerald@cs.ualberta.ca)
Tue, 26 Dec 1995 04:26:59 -0700 (MST)


Daniel W. Connolly writes:

> Hypermail is great. Mhonarc is even better. But I've got a lot of
> ideas for improvements:

I wrote something called "HURL: the Hypertext Usenet Reader & Linker."
(A better name would be "a hypertext interface to news archives").
More info is at: <URL:http://ugweb.cs.ualberta.ca/~gerald/hurl/>.

Before I get anyone's hopes up, I should point out:

- you can't actually play with it now, because all existing builds
are on sunsite.unc.edu, which recently suffered a major disk crash;

- I *still* don't have a distribution package ready, although I've
been promising one for ages. Hopefully within a couple of weeks.

If you want to see what the interface looks like, there are screen shots
available at the URL above. Hopefully Sunsite will be back to normal RSN.

> Requirements:
>
> 0. Support MIME ala mhonarc.

HURL was originally designed for Usenet archives, and since MIME isn't
widely used on Usenet (yet), this hasn't been a high priority. Right now
it treats everything as text/plain and puts it in <PRE> blocks. I don't
know what mhonarc does; I could probably make HURL handle text/html easily
enough, but there's other stuff I'd like to work on first, I think.

> 1. Base the published URLs on the global message-ids, not on local
> sequence numbers. So in stead of:
>
> http://www.foo.com/archive/mlist/00345.html
>
> I want to see:
>
> http://www.foo.com/archive/mlist?message-id=234234223@bar.net
>
> This allows folks to write down the URL of the message as soon
> as they post it -- they don't have to wait for it to show
> up in the archive.

Yup. HURL's URLs are something like:

http://www.foo.com/www-talk/msgid?foo@bar.net

Broken links are a big peeve of mine, so I've tried to make sure that
any URLs created by HURL will work forever.

FYI, Kevin Hughes at EIT has written a script that redirects message-ID-
based queries to the appropriate URL within his Hypermail archives of the
www-* lists; for more information see
<URL:http://www.eit.com/www.lists/refer.html>.

> Hhmmm... I wonder if deployed web clients handle relative query
> urls correctly, e.g.:
>
> References: <a href="?message-id=0923408.xxx.yyy">09823408.xxx.yyy</a>

With HURL this is just <a href="msgid?0923408.xxx.yyy">09823408.xxx.yyy</a>.

Message-ID references only get linked if the article actually exists in
the archive. So if there are 5 articles in the References: line, and only
4 of them happen to be in the archive (possibly due to a thread that was
dragged in from another group), the 5th one doesn't get a link. This was
expensive, but worthwhile IMO because it prevents error messages like
"sorry, that article isn't in the archive." Also, msgid refs get linked
within the body of articles, not just in the References line.

> 2. Support format negotiation. Make the original message/rfc822 data
> available as well as the enhanced-with-links html format -- at the
> same address. This _should_ allow clients to treat the message as a
> message, i.e. reply to it, etc. by specifying:
>
> Accept: message/rfc822

Hmm. No format negotiation, but there's a "see original article" link
that shows the current article with full headers and without all the extra
hypertext junk. Also, the original RFC822-style article can be retrieved
using a URL of: http://site.org/path/original?foo@bar.com
instead of: http://site.org/path/msgid?foo@bar.com .

> 3. Keep the index pages to a reasonable size. Don't list 40000
> messages by default. The cover page should show the last 50 or so
> messages, plus a query form where folks can select articles...

Yup. A query result puts you in something called the Message List Browser,
which shows (by default) 100 messages per "page", with "Next" and "Previous"
links to other pages, etc.

> 4. Allow relational queries: by date, author, subject, message-id,
> keywords, or any combination. Essentially, treat the archive as a
> relational database table with fields message-id, from, date, subject,
> keywords, and body.

There's a query page that allows for any article headers to be queried
and combined with AND logic. OR logic can be specified on individual
parts of each query using a comma, so you can do this:

Subject: center,centre
AND Date: 94
AND From: netscape

to "find articles with Subject containing center or centre posted in 1994
by someone whose From line matches netscape".

Unfortunately, back when I started this project I couldn't find a good
freely-distributable database package, so the current query system is an
ugly hack (it works, but it's inefficient). However, a friend of mine has
been working on an Isite-based replacement, apparently with good results.
I hope to replace my hack with his stuff sometime in the future.

> Goals:
>
> 5. Generate HTML on the fly, not in batch.

Yup.

> Cache the most recent pages of course (in memory?),
> but don't waste all that disk space.

I don't think this would be a win for HURL, because:

- pages have state info encoded in them (such as a cookie identifying
the current query result), so each returned article is unique;

- the article-displaying script isn't "slow" (relatively, anyway).
(and the HTML version is never stored on the server).

Better for HURL would be to cache query results, which is on my list of
things to do.

> Update the index in real-time, as messages arrive, not in batch.

I have nightly builds of several archives (and builds are rotated into
place so there's no downtime), but there's no incremental indexing (yet).
I had initially envisioned the builds taking place infrequently so this
wasn't a high priority, but it's one of the things I want to implement next.

> 6. Allow batch query results. Offer to return the raw message/rfc822
> data (optionally compressed) for, e.g. "all messages from july 7 to
> dec 1 with fred in the from field".

I plan to add the ability to download a .tar.gz or .zip file of the
messages comprising the current query result in their original RFC822
format.

> 7. Export a harvest gatherer interface, so that collections of mail
> archives can be combined into harvest broker search services where
> folks can do similar relational and full-text queries.

I've had some good preliminary results with full-text searches against
individual archives using Glimpse, but nothing like Harvest yet...

> 8. Allow annotations (using PICS ratings???) for "yeah, that
> was a really good post!" or "hey: if you liked that, you
> should take a look at ..."

The original motivation for this project was to do exactly that: take
150,000 articles from talk.bizarre and sort them into the Good, the Bad,
and the Ugly. (For talk.bizarre, the Good are few and far between,
but when they're good, they're really good).

ObAttribution: Mark-Jason Dominus was the one with the original idea to
do this article-scoring stuff (in fall of 93, I think), and he was the
one with the incredible foresight to archive everything posted to t.b.
for the last five years (and counting).

I'm not sure how this voting stuff should proceed exactly; it could be
as simple as vote-on-a-scale-from-one-to-ten, or something much, much
more complex (and powerful).

> 9. Make it a long-running process exporting an ILU interface, rather
> than a fork-per-invocation CGI script. Provide a CGI-to-ILU hack for
> interoperability with pre-ILU web servers.

Uh. I'll just pretend I didn't see this.

> Major brownie points to anybody who builds something that supports at
> least 1 thru 4 and makes it available to the rest of us.

Mmm, brownies. I still fail the "make it available to the rest of us"
condition, though. Even if I wanted to make a distribution package today,
I couldn't, due to the Sunsite crash which has left me (temporarily)
without access to my most recent code.

> I'd really like to use it for all the mailing lists around here.

I've been meaning to do a build of www-* and html-wg for a while now,
and Sunsite recently got another 20 gigs of disk, so as soon as things
settle down over there I might take care of this...

:
> Ah! I just remembered one more:
>
> >Goals:
>
> 10. Support a WYSIWYG-ASCII format, alal SeText or WikiWikiWeb[1]
> so that folks can send reasonable looking plain text email,
> but it can be converted to rich HTML in an automated fashion.

I'm not quite clear on this, but I try to add links within articles
whenever appropriate. This sort of interface definitely has lots of neat
possibilities; some of the things I've done so far include:

- auto-recognition and linking of URLs within articles (of course);

- if you're reading an article and see a word you don't understand,
you can click on "Filters..." and then "Webster" to reload the
current article with each (non-trivial) word linked to the Webster
gateway at cmu.edu (other filters include rot13 decoding, etc.);

and things I'd like to do include:

- auto-recognizing footnote references like your [1] above (i.e.,
putting an HREF to a fragment ID of "foot1" and putting a matching
NAME around the appropriate footnote at the bottom of articles).
I'd have to change my article-displaying loop to be a two-pass
process if I wanted these links to be "reliable", though.

- auto-recognizing any occurrences of "ISBN x-xxxx-xxxxx-x" and
linking them to a complete Library of Congress entry on that
publication.

- adding a nice way to specify newsgroup-specific filters; for instance,
for an archive of comp.lang.perl.misc you might want to link any perl
function references within each article to a hypertext man page
somewhere, or link perl5 package references to more information on the
package; for an archive of rec.food.recipes you might want to add links
from each ingredient listed to an imperial-to-metric converter...

> and Daniel reminded me:
>
> 12. Interoperate with USENET somehow. (perhaps the archive
> parallels a moderated newsgroup?)

Interoperating with Usenet wasn't part of the original plan, since I
figured that most of the time people would be reading really really
old stuff: does it make sense to followup to a 4-year-old article?
How about a 10-year-old article? (We have stuff from net.bizarre, too.)
It's a possibility, though.

Gerald
247 lines!? ugh, sorry...

-- 
Gerald Oskoboiny  <gerald@cs.ualberta.ca>  http://ugweb.cs.ualberta.ca/~gerald/