port-macppc: Re: Ungraceful low memory issue

Subject: Re: Ungraceful low memory issue
To: John Klos <john@ziaspace.com>
From: Chuck Silvers <chuq@chuq.com>
List: port-macppc
Date: 08/15/2004 09:02:38

hi,

On Wed, Aug 11, 2004 at 04:28:33AM -0400, John Klos wrote:
> Hi,
> 
> I suppose I should create a PR about this. It seems that with 2 gigs of 
> memory in use and only a little into swap, the kernel loses its ability to 
> allocate memory. This is the third time I've crashed this machine while 
> trying to stress test it. NetBSD 2.0.
> 
> sd2(esiop0:0:2:0): unable to allocate scsipi_xfer
> raid0: IO Error.  Marking /dev/sd2a as failed.
> sd2: not queued, error 12
> sd2(esiop0:0:2:0): unable to allocate scsipi_xfer
> sd2: not queued, error 12
> sd1(esiop0:0:1:0): unable to allocate scsipi_xfer

I've talked about this "unable to allocate scsipi_xfer" stuff before...
if we can't allocate an xfer we should put the buf back on the bufq
and try to start the i/o again later.  we should also set a low-water-mark
on the scsipi_xfer pool so that there will always be some descriptors
available, to guarantee forward progress.  that would still leave the
issue that one device can starve other devices, but that probably
won't be a big problem in practice, since we'll usually be able to
allocate more pages to the pool soon after we get into this state.

the same issue exists again in some of the HBA drivers (eg. ncr53c9x).

> raid0: IO Error.  Marking /dev/sd1a as failed.
> sd1: not queued, error 12
> raid0: failed to create a dag. Too many component failures. 
> ...
> pool rf_daglist_pl: putting with none out
> panic: pool_put
> Begin traceback...
> 0xd5e7bdf0: at pool_do_put+0x2b0
> 0xd5e7be40: at rf_FreeDAGList+0x18
> 0xd5e7be50: at rf_FreeRaidAccDesc+0x34
> 0xd5e7be70: at rf_State_LastState+0x80
> 0xd5e7be90: at rf_ContinueRaidAccess+0xf0
> 0xd5e7beb0: at rf_ContinueDagAccess+0x158
> 0xd5e7bf00: at DAGExecutionThread+0x158
> 0xd5e7bf40: at cpu_switchto+0x44
> 0xd5e7bf50: at ADBDevTable+0x73bc8
> End traceback...

the pool panic is a separate bug, I'd guess the raidframe code is
freeing an object twice in an error case.

-Chuck