Re: netbsd-5 deadlocks when memory is low


> I have been dealing with a deadlock problem on netbsd-5, which I though was
> related to PUFFS, but it seems it is another problem, hence the new thread.
> When running perfused stress test (a build of NetBSD over a glusterfs
> volume), memory gets low, and the machine hangs. I can see ioflush
> is sleeping on km_getwait2 kernel memory allocation in PUFFS code:
> puffs_vnop_fsync/flushvncache/puffs_vnop_strategy/puffs_msgmem_alloc
> (the backtrace has more function calls, but this is to give an idea).
> Indeed memory is low, but there is a lot of free swap, as DDB tells us
> (see below). I understand pageademon should provide the new pages, but
> it keeps sleeping on pgdaemon. 
> Code from HEAD works much better: intead of freezing, it kills big
> processes. It would be nice if we could reach that stage on netbsd-5: 
> having a system freezing is bad. It does not even reboots on its own.
> If someone has suggestions on how to improve that. For instance, why
> does pagedaemon stays aslept while I have 1 free page left in memory
> and 98303 left in swap.

the pagedaemon stays sleeping because there's already enough paging
requests in-progress.  (see "paging=" in the ddb show uvm output.)
it (reasonably) assumes i/o will complete "soon".

i guess this kind of deadlock is unlikely for some other os (eg. linux)
because they flush dirty pages more eagerly than us.
tracking and limiting the number of dirty pages in a system has been on
my todo list for a while but i don't think i will have enough time for
this in near future.  please feel free to take it. :-)


> db> show uvm                                                    
> Current UVM status:
>   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>   63217 VM pages: 37237 active, 18514 inactive, 3075 wired, 1 free
>   pages  44073 anon, 11651 file, 2557 exec                        
>   freemin=256, free-target=341, wired-max=21072
>   faults=28130665, traps=28370005, intrs=5696666, ctxswitch=12864729
>   softint=10808869, syscalls=1277512298, swapins=30, swapouts=62    
>   fault counts:                                                 
>     noram=3, noanon=0, pgwait=0, pgrele=0
>     ok relocks(total)=3868(3868), anget(retrys)=6761147(1482), 
> amapcopy=4569671
>     neighbor anon/obj pg=5435996/54403069, gets(lock/unlock)=13039755/2386    
>     cases: anon=4936937, anoncow=1752606, obj=10898520, prcopy=2141234, 
> przero=8
> 357613                                                                        
>   daemon and swap counts:
>     woke=55, revs=48, scans=55713, obscans=18396, anscans=33358
>     busy=0, freed=50847, reactivate=3106, deactivate=76784     
>     pageouts=4210, pending=29422, nswget=1482             
>     nswapdev=2, swpgavail=98303              
>     swpages=98303, swpginuse=33496, swpgonly=31129, paging=907
> -- 
> Emmanuel Dreyfus

