port-sparc: Re: out of space in kmem

Subject: Re: out of space in kmem_map on 2.0
To: None <port-sparc@NetBSD.org>
From: Steve Rikli <sr@genyosha.net>
List: port-sparc
Date: 01/07/2005 12:37:28
On Fri, Jan 07, 2005 at 07:16:53PM +0000, George Harvey wrote:
> Hi,
> 
> My SS20 died with 'panic: malloc: out of space in kmem_map' this
> afternoon while processing mail, it's running 2.0 MP with
> fetchmail/procmail/spamassassin. Checking the port-sparc mail archive, I
> see a similar problem on a SS2 in December and the suggestion then was
> to increase NKMEMPAGES. Does that solve the problem and if so, what is
> a good number to increase it by?
> 
> Possibly related: I checked the logs and before it died I was getting
> memory allocation errors from the scsi driver:
> 
>   /netbsd: sd0(esp0:0:3:0): unable to allocate scsipi_xfer
>   /netbsd: sdstart(): try again
> 
> My system is a SS20 with dual Ross HyperSparcs and 256MB RAM. Kernel is
> 2.0 MP release and NKMEMPAGES is currently 1536.

I'm pretty sure there have been multiple threads on this topic (or
related?), in current-users and port-sparc et al.

  "Increasing default value of NKMEMPAGES_MAX" , Feb2004 port-sparc
  "Ungraceful low memory issue" , Aug2004 current-users
  ""unable to allocate scsipi_xfer" error messages" , Sep2004 current-users
  "panic: malloc: out of space in kmem_map" , Dec2004 port-sparc

and maybe even these older ones:

  "panic: malloc: out of space in kmem_map" from Oct2002
  "panic: malloc: out of space in kmem_map" from Mar2003
  "unable to allocate scsipi_xfer." from May2003

Some folks in those threads reported bumping NKMEMPAGES_MAX to e.g. 5120
helped avoid the problem.  The "Ungraceful low memory issue" thread was
actually happening on i386 (PR#25670 ?), I think, but it looked outwardly
similar to what some of us have been seeing on SPARC, FWIW.

I've got 2 SS20 similar to George's, but w/512MB each.  One runs 2.0 and
the other runs -current.  I see a mixture of the kmem_map panic and the
scsipi_xfer message on both, usually overnight, probably during the daily
cron run if I had to guess (I haven't watched it).

I recently took the 2.0 SS20 to a new kernel w/5120 on NKMEMPAGES just
to see what happens, and so far it has survived 1 nightly cron while
building a new distribution simultaneously.  Still going for now.

Interestingly, my 3rd SS20 runs 2.0 GENERIC.MP , has only 128MB and the
default vm.nkmempages = 1536 , and has been up for over 2 weeks doing
the same bind/sendmail sort of duties as the others.  Last time it
booted was planned, and I don't think it has ever experienced either
problem mentioned.  I've been half thinking of pulling 256MB out of
one of the test SS20 just to see what happens.

Someone in one of the scsipi_xfer threads asked what 'pmap -p 0' looked
like, and while I haven't been logging it my occasional inspection shows
the 128MB SS20 doesn't seem to increase beyond about 8-90000K in total,
while the 512MB systems go up to 170000K and maybe(?) higher.  The 128MB
system seems to shrink back down, I think, but I'm not sure I've ever
seen the 512MB systems decrease.  I'm not sure how to interpret what I
see there, being a sysadmin rather than a kernel/system programmer.

I'm planning to shuffle hardware around on the 2 512MB systems since I
got hold of faster hypersparc CPUs, so I have a bit of freedom to try
some things, if anyone has any suggestions.

cheers,
sr.