current-users: Re: Ungraceful low memory issue

Subject: Re: Ungraceful low memory issue
To: None <john@ziaspace.com>
From: Havard Eidnes <he@netbsd.org>
List: current-users
Date: 08/14/2004 00:07:01
> I suppose I should create a PR about this. It seems that with 2 gigs =
of =

> memory in use and only a little into swap, the kernel loses its abili=
ty to =

> allocate memory. This is the third time I've crashed this machine whi=
le =

> trying to stress test it. NetBSD 2.0.
>
> sd2(esiop0:0:2:0): unable to allocate scsipi_xfer
> raid0: IO Error.  Marking /dev/sd2a as failed.
> sd2: not queued, error 12
> sd2(esiop0:0:2:0): unable to allocate scsipi_xfer
> sd2: not queued, error 12
> sd1(esiop0:0:1:0): unable to allocate scsipi_xfer
> raid0: IO Error.  Marking /dev/sd1a as failed.
> sd1: not queued, error 12
> raid0: failed to create a dag. Too many component failures. =

> ...

This sounds fairly similar to the problem I reported in PR#25670,
where a PC with 1GB RAM will occasionally spew messages like

May 22 00:43:52 dolly /netbsd: sd3(mpt0:0:1:0): unable to allocate scsi=
pi_xfer
May 22 00:43:54 dolly /netbsd: sd3: not queued, error 12
May 22 00:43:54 dolly /netbsd: sd3(mpt0:0:1:0): unable to allocate scsi=
pi_xfer
May 22 00:43:54 dolly /netbsd: sd3: not queued, error 12
May 22 01:03:09 dolly /netbsd: sd3(mpt0:0:1:0): unable to allocate scsi=
pi_xfer
May 22 01:03:11 dolly /netbsd: sd3: not queued, error 12

when a little stressed by I/O.  This also sounds similar to the
problems people are reporting about various SPARC systems (MP or UP).

Therefore, I smell a machine independent kernel bug in this area.  I
tried to look at under what conditions the pool allocator (which is use=
d
to allocate scsipi_xfer structs) would return "no memory available", bu=
t
being somewhat unfamiliar with the surrounding environment I was unable=

to pinpoint what exactly the problem was.

If I recall correctly, scsipi_xfer allocation is done with "nowait" set=
,
and I would not be surprised if it turns out that physical (?) memory
resources are completely depleted when this problem occurs (e.g. as
caused by the dynamically growing file system and vnode cache), and the=

caller said "don't wait", so it'll return "sorry, nothing available",
triggering the above symptoms.  I'm sure this theory could be verified
by instrumenting the pool allocator and/or the place(s) where
scsipi_xfer structs are allocated, e.g. by printing "free physical
pages" at that point.

If this guesswork is right, I don't really have any idea how to go abou=
t
fixing it, though -- that would be something for the VM experts among u=
s.

Regards,

- H=E5vard