tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Kernel memory allocation oddities in NetBSD-10.99.12



On Mon, 4 Aug 2025 10:41:22 -0700
Brian Buhrow <buhrow%nfbcal.org@localhost> wrote:

> 	hello.  I've been trying to track down a hard to reproduce
> lockup on one of my NetBSD/amd64 10.99.12 machines.  As part of my
> investigation, I've realized the issue is most likely related to some
> kernel memory allocation failure. This caused me to start monitoring
> the number of memory allocation failures, using the output of vmstat
> -m, piped through an awk script  which calls out any allocations that
> have experienced failures since the last reboot.
> 
> For example, on the machine, which has been up for 27 days, 22 hours, 
> I see: 
> 
> biopl 272 3639296 2 3639204 54830 54584 246 532 0 inf 236
> buf16k 16384 3380407 37 3355948 166739 157470 9269 13892 0 inf 0
> pcgnormal 256 25244948 153624 25244528 496390 496356 34 1095 0 inf 7
> pvpage 4096 1017753 8 1016052 474928 473227 1701 3701 0 inf 0
> xnfrx 4096 1538484 20 1538196 127677 127386 291 435 0 inf 3
> Totals 279028664 153691 275787771 3828646 3534742 293904
> 
> 
> 	The allocation category which fails the most is the
> pcgnormal, which is a fixed buffer of 256 bytes per allocation.  
> 
> 	One question I have is, why would pcgnormal fail so often,
> while, at the same time, 
> 
> execargs    262144 2436031   0  2436030 30522 30509    13    16     0
>    16   12
> 
> which allocates far larger block sizes far more often, experience no
> allocation failures?  
> 
> If execargs can always find 200,000 bytes to allocate, how is it that
> pcgnormal or xnfrx, which allocate 256 bytes and 4096 bytes
> respectively, cannot?
> 
> -thanks
> -Brian
> 

Hello Brian,

I haven't checked all affected pools/pool_caches but those which are
affected by the "failed" allocations have the allocation done with the
PR_NOWAIT flag set which means failing is expected if no memory is
available and it's up to the caller to retry/fail.
If I remember right those pools have a higher IPL_ level set (at least
the once I checked) to allow allocations from interrupt context or
soft interrupt context and the allocation is not allowed to sleep for
these cases so there is an assert_sleepable check in the pool_get
function to catch errors.

kind regards,
para 

-- 
You will continue to suffer
if you have an emotional reaction to everything that is said to you.
True power is sitting back and observing everything with logic.
If words control you that means everyone else can control you.
Breathe and allow things to pass.

--- Bruce Lee


Home | Main Index | Thread Index | Old Index