tech-net: problem with Squid losing socket file descriptors (NetBSD-1.3.3)

Subject: problem with Squid losing socket file descriptors (NetBSD-1.3.3)
To: NetBSD Networking Technical Discussion List <tech-net@NetBSD.ORG>
From: Greg A. Woods <woods@most.weird.com>
List: tech-net
Date: 10/23/1999 00:39:45
I've been having recurring problems with a transparent squid
installation that keeps crashing in various ways (usually squid goes
wild writing the same error message about too many open files to its log
files causing that partition to fill up).  I've finally figured out that
it's apparently not Squid's fault that it's slowly leaking file
descriptors.  If fact it's apparently the kernel's leak!

The symptoms are that squid reports EFAULT from accept():

1999/10/22 23:07:15| comm_accept: FD 45: (14) Bad address
1999/10/22 23:07:15| httpAccept: FD 45: accept failure: (14) Bad address

At this point a socket file descriptor seems to go off into never-never
land and is from then on unusable.  At this point "lsof-4.39" reveals
entries such as the following:

COMMAND PID  USER   FD   TYPE     DEVICE  SIZE/OFF    NODE NAME
squid   257 squid   68u  inet                  0t0     TCP can't read inpcb at 0x00000000 

(I occasionally see similar reports for FD# 0 & 1 for xterms on my 1.3.3
test machine too, but so far never anywhere else.)

Something also goes wrong in the kernel as we eventually get various
random types of panics if squid is regularly re-started to try and fix
this problem.  The box never survives more than a week, and under the
current load conditions squid doesn't run much more than 24hrs (it
handles upwards of 800 requests per minute on average).  Squid itself
also seems to exhibit these symptoms and other weird behaviour far more
often if it's restarted without a reboot.

The folks on <squid-bugs@ircache.net> have suggested:

    The most likely explanation is that the version of the OS kernel you
    use has a busted implementation of accept(), leaking filedescriptors
    on some kind of internal error (probably fails to handle aborted
    connects or something similar).

and I would tend to agree given what I've been seeing and other similar
bad history I've had with various implementations.

In digging further I noticed some promising changes to
sys/kern/uipc_syscalls.c with the following commit message:

revision 1.43
date: 1999/05/05 20:01:09;  author: thorpej;  state: Exp;  lines: +95 -29
Add "use counting" to file entries.  When closing a file, and it's reference
count is 0, wait for use count to drain before finishing the close.

This is necessary in order for multiple processes to safely share file
descriptor tables.
----------------------------

How likely is it that these changes will fix the problem I'm seeing?

If not are there other fixes in -current that would fix this problem?

How independent are any available changes (i.e. can I pull them up to my
1.3.3 source tree and give them a try)?

If there are fixes in -current that do work what's the chance of getting
them pulled up for 1.4.2 if they're not already?  :-)

FYI, searches of the PR database don't reveal anything similar as far as
I can tell.  Should I file a PR based on the above?

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods@acm.org>      <robohack!woods>
Planix, Inc. <woods@planix.com>; Secrets of the Weird <woods@weird.com>