Some of you may remember my PR#s 38019 and 38246. For a long time the problem seemed to be fixed, ever since I started running a kernel built from the "wrstuden-fixsa" branch off of "netbsd-4". Recently though some upgrades of my server I've managed to briefly run a netbsd-4 kernel built from the head of the branch, i.e. ever since the commit done for ticket #1196. Unfortunately the problems described in the PRs mentioned above reappeared immediately and with a vengance -- even a few runs of "cvs update" of the NetBSD source tree could trigger it shortly after a fresh boot. In fact it had been so long since I built the last wrstuden-fixsa kernel that I didn't remember doing so and then after seeing this problem again I was almost believing the wrstuden-fixsa changes had not been merged to netbsd-4. It seemed as though the original wrstuden-fixsa changes were working very well, but somehow now since they were merged to netbsd-4, some regression may have occurred. Perhaps somehow related I would like to note that briefly I also played a bit more with tuning to try to fix the problem, managing only to make it worse when I increased vm.nkmempages from 32768 to 65536. With this change the problem occurs with as little as one run of "cvs update". So, I'm wondering if anything was possibly missed in the pullup done for ticket #1196. Also I'm wondering if anyone else has experienced any strange lockups on large-memory servers with recent (since 2008/09/16) netbsd-4 kernels on i386? Particularly of interest are lockups where DDB shows processes sitting in "vmmapva". Can these "vmmapva" lockups even be related in any way to the fixsa changes? I've done a diff of my relevant local source trees but I can't really see anything critical (my current netbsd-4 tree includes new local changes made since I built the working wrstuden-fixsa kernel, plus of course all pullups done to the netbsd-4 branch since wrstuden-fixsa was merged). The only major differences I note are in some kernel tuning parameters, which I suppose may partly be responsible for exhausting more KVA space. I'm going to try reducing BUFCACHE to 10% from the default 15% to see if that makes room for whatever else changed. I'll also reduce NKMEMPAGES_MAX back down to 128MB worth (I'm gussing the kmem_map it controls the size of is in KVA space and so squeezes things even more). Interestingly the older kernel I was running successfully I believe has bufcache set at the default 15% as sysctl was showing that indeed nearly 15% of RAM matches the vm.bufmem_hiwater value. BTW, I did try setting 'options KERNBASE="0x80000000UL"', but that simply results in an instant reboot of my test machine right after the kernel was loaded, so somehow that doesn't work any more. Is there any simple way I could write a kernel thread or similar to watch for KVA space exhaustion and at least report it before things grind too much to a halt (kernel console output still works in this state at least)? -- Greg A. Woods Planix, Inc. <woods%planix.com@localhost> +1 416 218-0099 http://www.planix.com/
Attachment:
pgpQTugrLpNl0.pgp
Description: PGP signature