current-users: kernel resource leak? (NetBSD-current from February)

Subject: kernel resource leak? (NetBSD-current from February)
To: None <current-users@netbsd.org>
From: Jarle Greipsland <jarle@idt.unit.no>
List: current-users
Date: 10/13/1994 21:07:14
Hi, let me start it off by declaring that this is not a bug report.  It's
more of an inquiry to see if someone else has detected the behaviour
described below, and may know something about what causes it. So, if
someone (whoever that might be) finds that this post is lacking in detail,
either press 'd' for delete or email me for more details.

Okay, this is old stuff and I should probably have reported it a long time
ago.  Sorry.  I'm partly responsible for running a fileserver for a bunch
of PCs, mainly OS/2 machines.

The fileserver:
OS: NetBSD-current from late February 94 (I told you it was old! And don't
    bother with why I installed -current, okay :-)
    Some options from config 'NFSSERVER, NFSCLIENT, GATEWAY, SCSI, DDB' ++
    (more available upon request)

HW: i486-66, 256kb cache, EISA bus, 3 WD8013EBT boards, aha1742, 
    2 Quantum PD1800S (1800 MB SCSI 2), 1 Maxtor XT-8702S (578MB SCSI 1), 
    1 Quantum PD1225S (1200 MB SCSI 2). "No-brand" VGA board.
    2 16550 uarts, 1 lpt0 printer port

System setup: 
   serving approx 30 OS/2 boxes on one segment (NFS, BOOTP)
   serving approx 20 OS/2, DOS or Windoze boxes on another segment (NFS,BOOTP)
   (serving == exports read only several directories with software packages)
   hooked up to 'the world' (cisco router, no clients) on the third segment.
   maildrop for < 10 people.
   hp laserjet on printer port.
   28.8 uCom modem on one of the serial ports
   runs: 4 nfsd, gated, xntpd, sendmail, lpd + standard stuff
   
Problem: Over a week's time or so, sometimes more, sometimes less, the load
(as reported by w and uptime) gradually picks up, unnoticeably at first,
then at a more rapid rate.  Whenever no external activity takes place it
drops to approx 0, but as soon as an nfs-daemon or printer filter (or
whatever) gets work to do the load increases rapidly.  The funny thing is
that almost no cpu time is spent in user mode, the system seems to devour
cpu cycles for its internal use at at terrifying rate.  My hunch is that
there is a resource leak somewhere in there, and that this resource
eventually gets scarce enough that the processes have to really compete for
it.  This can explain why all processes, even telnets and shells, start to
spend extra time in the system cpu state (as soon as they get active).

I suspect a memory leak, but I don't know for sure.  The only thing I see
that I find a bit odd, but don't have enough knowledge about to interpret
properly, is the output from 'vmstat -m' just before we rebooted it.  Down
this list I see:

Memory statistics by bucket size
Size   In Use   Free   Requests  HighWater  Couldfree
      16     1147    901    3299684    1280          0
      32      533    235     648778     640          0
      64     2819    317    1730772     320        215
     128      195    285   69949362     160    5609735
     256      177     79     206599      80        865
     512     1362      6      10615      40          0
    1024       23      5    3899612      20          0

What does the couldfree imply?  Does a high number in the couldfree column
signify a problem or is it just an 'interesting tidbit'?

Also, it seems that it's the vnodes that really gobbles memory.
      vnodes   1379     678K      682K  3687K    10527      0         0
but it's the mbufs that has the highest frequency
        mbuf      8       2K       21K  3687K 69533027      0         0

The rest of the 'stats by type' mostly say below 10K, with a few above 10K,
but all below 100K.  The memory totals says:
emory Totals:  In Use    Free   Wasted   Requests
                 1042K    129K      16K   79989910

Is this the way it ought to be?  If not, can this be caused by the OS/2
machines mounting NetBSD partitions and never unmounting them?  (I suspect
that client activity triggers the described behaviour, because during
summer break, when no, or just a few, students were using the PCs, the system
behaved impeccably.)

So, I guess my question really is: Has anyone else seen this behaviour?
Anyone know what may cause it?  Is it not a memory leak, but something
completely different? If this triggers someones longterm memory, is it
fixed in 1.0(beta)?

We're planning to upgrade to 1.0 as soon as it becomes available, so that
may solve our problem (But we may have to look for alternatives if it
doesn't.  That's why I would like to know.)  Anyway, don't waste any time
on this one unless it rings a bell fairly immediately.  That includes you,
mycroft :-)

					-jarle

PS. The phase of the moon doesn't seem to have any influence.  Just thougt
some of you might like to know..... DS.
----
"This terminal is no more. It has ceased to be. It's expired and
 gone to meet its maker. This is a late terminal. It's a stiff.
 Bereft of life, it rests in peace. If you hadn't nailed it to the
 bench, it would be pushing up the daisies.  It's run down the
 curtain and joined the choir invisible.  This is an X-Terminal!"
                                                - Unknown