And now for something different... multipage scanned images

Larry Masinter (masinter@parc.xerox.com)
Fri, 13 Aug 1993 22:54:10 PDT


I've been puzzling how to reasonably integrate into WWW and XMosaic
access to multi-page documents that are *only* available as multi-page
images. We (and a number of other organizations) have large
collections of documents that are only available currently on paper.

Some of them are just a few pages; many reports are 40-50 pages, and,
of course, many books are hundreds of pages.

It is reasonable -- at least as reasonable as sending talk radio -- to
think about reading books on a high-resolution screen across the
internet.

The question is, how to build this kind of access into WWW.

The first thing to realize is that it isn't reasonable to transfer the
entire document before starting to read it. That might work for
something that is a page or two, but doesn't work well at all for a
30-40 page report, and much less for a 300 page book.

A typical page image scanned at 300 DPI binary and compressed with T.6
(Group 4) compression might be 30-50K bytes.

A couple of alternatives come to mind:

a) try to squeeze this into HTML+++ somehow.
I've thought of some possibilities here, along the lines of the
current discussion of online 'books' with chapters and indices, but
allowing for inline images to be retrieved as 300DPI compressed
binary, but decompressed and antialiased to grey by the client).

b) create a new document type application/bookreader, and a specific
application 'bookreader' which knows how to deal with 'books'. It
would be reasonable for a 'book' to consist just of a list of URLs or
URNs, one for each page, and build a special bookreader that retrieves
the images as you're reading the document online.

(a) seems hard, and I'm leaning toward (b). (b) has the slight
advantage that it can be shoehorned into systems that currently only
do 'gopher'. Have others thought of this issue? Opinions? Ideas?

-- Larry

p.s. it seems pretty clear that you can't get more concrete with
'presentation' than an image of the page.