PATH_INFO on the cheap?

Robert S. Thau (rst@ai.mit.edu)
Thu, 30 Dec 93 12:07:57 EST


In re the current PATH_INFO discussion, it may be possible to cut the cost
of the current implementations by piggybacking on work which the server (at
least the NCSA server) is already doing anyway.

Here (I think) is the trick. In order to do access control, the server is
already doing stuff like this (from my previous syscall trace):

open ("/.htaccess", 0, 0666) = -1 ENOENT (No such file or directory)
open ("/com/.htaccess", 0, 0666) = -1 ENOENT (No such file or directory)
open ("/com/doc/.htaccess", 0, 0666) = -1 ENOENT (No such file or directory)
open ("/com/doc/web-support/.htaccess", 0, 0666) = -1 ENOENT (No such file or directory)
open ("/com/doc/web-support/cgi-bin/.ht".., 0, 0666) = -1 ENOENT (No such file or directory)

(Hmmm... would a /.htacess ever be particularly useful? Never mind).

Suppose the pathname it's working on has appended PATH_INFO, say, something
like /usr/webhome/subdirectory/the-script/more/stuff/here. Then we get a
cascade which looks more or less like this:

open ("/.htaccess", 0, 0666) = -1 ENOENT
open ("/usr/.htaccess", 0, 0666) = -1 ENOENT
open ("/usr/webhome/.htaccess", 0, 0666) = -1 ENOENT
open ("/usr/webhome/subdirectory/.htaccess", 0, 0666) = -1 ENOENT
open ("/usr/webhome/subdirectory/the-script/.htaccess", 0, 0666) = -1 ENOTDIR

Note the error codes. Once the server has seen ENOTDIR, if it ever does
before running out of apparent directories, it can tell it has found the
actual directory containing the script (and in fact, the script itself,
which is not a directory). The rest of the pathname must be PATH_INFO,
since it corresponds to nothing in the filesystem. The cost for this case
(PATH_INFO present) is one failed filesystem lookup with no subsequent
access check.

If PATH_INFO is absent, on the other hand --- that is, if the submitted URL
after alias translation refers to an actual file, such as

/usr/webhome/subdirectory/the-script

then the last apparent directory in the translated pathname is

/usr/web/home/subdirectory

so the last open call above never happens (why should it?). The entire
pathname corresponds to something in the filesystem, so PATH_INFO must be
null. The cost for this more common case (PATH_INFO absent) is zero ---
the server would be doing all those system calls anyway.

Is it inelegant to combine the two directory walks like this? Perhaps.
I'm certainly not about to change the code to do this in order to gain
what I regard as a fairly minor efficiency bum --- but then again, I'm
running the server on the machine which has the disks to avoid network
filesystem overhead. However, I do think it's possible.

One unrelated point, to clear up a misunderstanding. I have *not*
expressed distaste for the redirection mechanism of HTTP/1.0. I like it.
I use it. There are plenty of useful things that you can't do without it.

I have expressed distaste for the specter of server config files that could
ultimately grow to look like:

Redirect ...
Redirect ...
Redirect ...
Redirect ...
Redirect ...
Redirect ...
Redirect ...
...

with one line for every time that I or anybody else here has changed their
mind. I'm not so much bothered by the overhead this puts on the server,
although that's there, as I am by the unmaintainability of the scheme
itself --- even if the redirection information is distributed, rather than
locked up in a single global config file, it means putting bits of history
and loose ends all over the place for people to trip on, which is a
prospect I find quite unattractive.

rst