Re: mystery NCSA httpd problems on gnn.com

Rob McCool (robm@neon.mcom.com)
Tue, 31 Jan 1995 00:48:49 +0100


/*
* "Re: mystery NCSA httpd problems on gnn.com" by Robert S. Thau
* written Mon, 30 Jan 95 14:02:38 EST
*
* I believe I've seen the same sort of thing (incoming connections
* timing out, no new connections being logged, CPU and disks dead,
* main server process shows up as blocked in accept() if I gcore(1)
* it and do a backtrace). Note that this doesn't seem to be entirely
* consistent with the "accept queue backup" story --- if the accept()
* queue on the socket is full to bursting, why doesn't the server
* accept new connections?

Because the queue is used both for connections that are ready to be
accepted as well as for half-negotiated connections. The latter can
fill the queue, starving any new connections from being negotiated.

* One other piece of puzzling evidence --- intense bursts of
* connections don't always provoke the bug. I try to keep track of
* peak load here by logging a histogram of transactions/sec
* vs. number-of-seconds. We routinely log bursts of >10
* transactions/sec a few times a day even on weekends, when this sort
* of "freeze-up" behavior doesn't seem to have been a problem.

We've always been able to track it down to a line being down. When the
watchdogs report a server not responding (both of them invariably do
it at the same time BTW even though they're on different outbound
lines), my first step is to look for a down route. Out of a list of
10-20 hosts, I ping to each one and usually by the second or third one
I encounter a failure. Traceroute can then generally find the down
route.

* Incidentally, killing off the server process and restarting it
* always gets things moving again (at least it does here), so that
* action seems to clear whatever inside the kernel is causing the
* bottleneck.

Yes, because the socket listening to port 80 is closed and then
re-opened with a fresh queue.

* That hack seems to have helped matters, but I'm not sure that it's
* gotten rid of the freeze-ups entirely --- I spotted something which
* looked an awful lot like the same old freeze on Friday, although
* this time the process was waiting in select(). If the bug keeps on
* showing up at an annoying rate, the next thing I'll try is closing
* and reopening the socket if no connection requests have come in for
* ten seconds or so, but that seems a little drastic.
*/

You have to be careful to prevent race conditions there. There's a
chance people could get connection refused if they hit your server at
just the right time.

--Rob