Subject: Re: Rash of "mb_map full" errors
To: None <port-sparc@NetBSD.ORG>
From: Jim Reid <jim@mpn.cp.philips.com>
List: port-sparc
Date: 05/20/1997 10:25:07
>>>>> "Greg" == Greg Earle <earle@isolar.Tujunga.CA.US> writes:


    Greg> My 1.2.1 SS20 has been getting a rash of these suckers as of
    Greg> late, and sure enough, "netstat -m" says why:

    Greg> % netstat -m 1174 mbufs in use:
    Greg> 1098 mbufs allocated to data
    Greg> 63 mbufs allocated to packet headers
    Greg> 13 mbufs allocated to socket names and addresses
    Greg> 511/512 mapped pages in use
    Greg> 1170 Kbytes allocated to network (99% in use)
    Greg> 0 requests for memory denied
    Greg> 0 requests for memory delayed
    Greg> 18770685 calls to protocol drain routines

    Greg> and I can't do an NFS mount to save my life at this point.

Because there are no mbufs to hold the networking data structures -
PCBs - and NFS request and reply data.

    Greg> Is there some knob I can drop into my kernel config file to
    Greg> get me more mapped pages for mbufs to use?

This is not the answer. (Or even the question.... :-)

There is Something Seriously wrong: probably a kernel bug. mbufs are
being allocated for data and then not being released. This will
probably be because of a kernel bug: the mbuf buffers are not getting
handed back to the buffer pool whenever the application is finished
with them. Usually this is caused by an OS bug - a device driver or
protocol code forgetting to call mfree or else a logic error means the
call getting skipped. It could be possible for abnormal applications
to do consume all the mbufs - say if lots of simultaneous TCP sessions
used large windows and the far end never ack'ed anything - but this is
a highly unlikely scenario.

The fact that you've had nearly 200 million calls to the protocol
drain routines points the finger at the kernel. These get called when
the mbuf pool is depleted to make some buffers available. [My machine
has been up for ~3 months and there have been no calls to the drain
routines.] Since you've had so many calls, it is clear your OS has a
problem.

Increasing the mbuf pool is unlikely to do you any good. It just means
that more memory can be guzzled by the mbuf eating code, which brings
you back to where we started. Also, as these buffers are in the
kernel, you deplete the available RAM for applications which makes
thrashing more likely.

I suggest that you run ps and look for processes in network busy wait
state. These may identify the applications which are at fault, if
there are any. Don't forget however that these processes could be in
that state because they're waiting for network buffers to become free,
so a snap analysis could be misleading. If you don't find any such
processes - how many applications on your system are likely to use a
long fat pipe? - then the problem lies in your kernel. I'd stake a
beer on that being the location of the trouble. If so, increasing the
mbuf pool just means it will take a little but longer for all the
buffers to get used up.

So, the real answer will be to find and fix the mbuf leak.