Re: Last-modified date & indexing

Nick Arnett (narnett@verity.com)
Thu, 10 Nov 1994 13:13:27 -0800


At 10:42 AM 11/10/94, Mike Schwartz wrote:

>I think HTTP, FTP, etc., are the wrong places to look to get indexing
>information.

That's fine, but it's a bit difficult to reconcile last-modified dates from
HTTP when they're the only thing that's available for updating an index.

If you only need to update a few documents out of many, it's *more*
efficient to request them from the Web server, even if you used a local
tool to build the index in the first place, which is what we're building,
as I think I said.

>In Harvest (http://harvest.cs.colorado.edu/) you can run a Gatherer at the
>archive site, and it builds all this information and exports it very
>efficiently.

I'm quite aware of Harvest. Unfortunately, there are tens of thousands of
HTTP servers that our products have to be able to index. We could ask all
of them to support Harvest's conventions, but I'm not going to wait to ship
a product until they do.

>In contrast, you can pull all of the data across the
>Internet from our server as a single compressed, structured stream in just
>a few mintues. On average, running an archive-site Gatherer causes about
>6,660x less load on the archive's CPU and 50x less network traffic than
>doing remote gathering ala the robots - and this doesn't include the
>savings you get from doing incremental updates (retrievals of the form
>"give me all the content summaries for objects changed/created since date
>XYZ").

So we agree, then, on the efficiency of the approach. I just can't figure
out why you started that paragraph with the words "In contrast."

Nick