Potentially serious error in unix server http logging.

Simon E Spero (ses@tipper.oit.unc.edu)
Wed, 16 Nov 94 08:07:42 -0500


Whilst doing a bit of log-file analysis, I've come across what appears to be
a worrying, and possibly serious error which seems to be common to every
http server for unix.

The problem comes in handling cases where a file request is aborted by the
client by the client dropping its end of the connection. This case is sometimes
erroneously logged as being successful even when the disconnect is detected;
however, for a large class of cases (including the one I was actually interested
in, demmit), the client disconnection is never discovered by the server.

The reason for this is related to the unix habit of asynchronous I/O. The server
can often complete its write to the output socket without all the data
having been delivered. When the server has finished writing its data to the
output stream, it will usually call close to bring down the connection. If the
linger option has been set, the close will wait for some time before completing-
otherwise the close will succeed immediately, and the data will be delivered
by the operating system in the background. In the latter case, the server
will not be able to detect the cancellation. IF linger option is selected,
implemented, and working, the failure will be reported as an error in the
call to close.

This problem only occurs if the server finishes writing data to the socket
before the cancellation is received. The amount of data remaining to be sent
at the point at which is occurs is dependent on the size of the socket buffer.
A typical size for this sees to be around 8K.

The reason I got got caught up in this mess is that I was trying to estimate the
effect of parallel retrievals on the number of cancelled transactions. This is
interesting, because as the number of concurrent fetches increases, so to
does the number of concurrent cancellations. Also, with browsers such as
IBM's explorer and Netscape's Netscape, transaction cancellation becomes an
active part of the operational paradigm, rather than an exceptional condition
occuring relatively infrequently. Unfortunately, the object types most affected
by this change in access patterns are those used for icons - which are also the
types most likely to be missed by the logs.

Simon
p.s.
Time for Hackaholics anonymous.
Hi. My name is Simon, and I don't check the error codes from close.