Sure. I've been considering several storage mechanisms:
1. MH folders
2. berkeley mail files
3. a berkeley mail file, plus an "index" file of just
the headers (stripping out Received and such),
and a DBM index that maps message-ids to offsets
in the raw file and the index file
4. msql
5. sybase, oracle, ...
For searching, grep on #3 should be mighty fast. Appending is fast.
Rebuilding the database is fast.
A relational interface can be implemented on top of any of those,
with various tradeoffs. One reason why I starded this discussion
was to discuss that interface. By the way... does anybody have
details of the M$ ODBC API handy?
It might be something like:
class Table
def select(self, query_clauses, sort_keys, target): ...
def insert(self, fields): ...
def update(self, clauses, fields): ...
defe delete(self, clauses): ...
class MessageArchive(Table):
def addMessage(self, message_stream): ...
Then you need the auxiliary tables for building links: references,
in-reply-to, heuristic subject-based threads. Converting message/rfc822
to HTML is almost completely separate issue.
def message2html(message_stream, linkbase): ...
>> 2. Support format negotiation. Make the original message/rfc822 data
>> available as well as the enhanced-with-links html format -- at the
>> same address. This _should_ allow clients to treat the message as a
>> message, i.e. reply to it, etc. by specifying:
>>
>> Accept: message/rfc822
>
>A reasonable request. Will be very useful when clients can process
>MIME data correctly.
That reminds me: we should be sure that the HTML has a link to the original
message, ala:
<link rel=enhancement href="mid:2o3423o4u2o3i4u2o34@foo.com">
>As for search engines, those can be hooked in independently; which some
>have done with mhonarc. It is a waste of my time, and probably other
>developers of mail processors, to write search engines when one can
>already utilize well developed ones like Lycos, Glimpse, etc.
For full-text searching, this is true. But I was talking about
relational queries.
>> 4. Allow relational queries: by date, author, subject, message-id,
>> keywords, or any combination. Essentially, treat the archive as a
>> relational database table with fields message-id, from, date, subject,
>> keywords, and body.
>
>This is best done by utilizing an existing database system (eg Oracle),
>and using mhonarc (or other prefered mail->html filter) to convert
>retrievied messages to html on-the-fly.
You can do a pretty good imitation of oracle with flat-files if you
know what the queries are likely to look like. See the msgarchive.py
stuff in the grail sources, for example.
>> Update the index in real-time, as messages arrive, not in batch.
>
>.forward
Right. MHonarc already allows this.
>However, I see many of the tasks can be done by a collection of tools
>and not a single tool.
Whatever. I just stated my requirements. I'm pleased with the discussion
that's followed.
> Trying to develop a single software program to
>do everything maybe wasted effort, and it does not make the best use of
>existing software that can do the job better (ie. I'm lazy and do not
>want to reinvent the wheel :-).
Right. Reusable software is good. But unix pipes are not the building
blocks I'm interested any more. I'm interested in objects, modules,
and interfaces.
> As long as mhonarc can be
>invoked just as a message/rfc822->HTML converter, then others have the
>ability to use that capability in whatever WWW mail archiving system
>that suits their needs.
As I said: I like the way mhonarc does a lot of things. I just don't
like the API to it (unix pipes).
>I'd like to remind people that many of the WWW tools/filters people use
>are developed on various individuals spare-time. As one's problem
>become more sophisticated, one should not hold his/her breath waiting
>for a free, ready-made, solution. Many times it will take the
>integration of several programs to come up with the desired solution
>because free software developers cannot solve everyone's problems. The
>solution to Dan's problem may be best be solved by an intelligent
>integration of several programs and not a single program.
The solution might also come from a collaboration between the folks
in this forum, and other forums. I'm hardly waiting for a ready-made
solution. I've spent quite a bit of time poring over the available
tools and starting to develop new ones. But as long as I'm developing
something, I'd like to get ideas from other folks who have been down
this path.
And I'd like to see mhonarc, hypernews, HURL, and such tools converge
and share code.
I'd also like to see it form the basis of a _better_ communications
facility, perhaps built on something like KQML[1].
Dan
[1] http://www.cs.umbc.edu/kqml/