I am having a stabilty problem which is hard to figure out. I recently updated a machine from netbsd-5 to netbsd-6 (i386). It's a pretty normal box: NetBSD 6.0_STABLE (GENERIC) #22: Wed Jan 23 18:08:47 EST 2013 gdt%ir.bbn.com@localhost:/u0/n1/obj/gdt-6/i386/sys/arch/i386/compile/GENERIC total memory = 3569 MB avail memory = 3497 MB cpu0 at mainbus0 apid 0: Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz, id 0x206a7 cpu1 at mainbus0 apid 2: Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz, id 0x206a7 cpu2 at mainbus0 apid 4: Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz, id 0x206a7 cpu3 at mainbus0 apid 6: Intel(R) Core(TM) i5-2310 CPU @ 2.90GHz, id 0x206a7 acpi0: X/RSDT: OemId <INTEL ,DH67CL ,01072009>, AslId <AMI ,00010013> I have 16G of RAM, but haven't switched to amd64 mode yet. Just before the upgrade, I had a problem where the machine would reboot just as amanda did estimates. I had just upgraded to 3.3, and it was using snapshots to do estimates. Even though our dump supports estimate only mode, amanda was killing dump after it printed the estimate (as it does for older dumps). My theory was that killing dump as it was tearing down a snapshot was bad. I rebuilt amanda w/o snapshot support (so it didn't give dump the snapshot flags), and then things were mostly ok. Then, I still had crashes and hangs. The main symptom is processes watiing on vmem. While in this state, the machine is up, but all processes and networking is hosed. ps from ddb shows many in vmem and tstile. Hitting return gets a login: prompt but then ^T shows vmem. Looking at pools in ddb, I see failed requests for mbufs and clusters, but I think this isa symptom, not the cause. Sometimes, swwdog causes a reboot. Sometimes it doesn't (perhaps becuase as long as that process doesn't need memory it doesn't get blocked; I keep meaning to make it do something else like read a random file in betweeen tickles). I find that the machine usually crashes overnight. I suspected daily cron job, so took those out, and last night it stayed up. Running "find . -tyep f > FILES" in my homedir resulted in find stuck in vmem and not responsive to ^C. I was able to ssh in and do 'reboot'. My kernel is pretty normal, but has IPSEC (kame) and coda. But I disabled coda and IPsec from running. I have seen a similar problem with an earlier netbsd-6 snapshot in a private tree. In that case, there's a kernel thread running over a huge amount of (kernel) ram, and per-packet processing also uses this huge amount of ram. This mostly works, but the machine locks up on the daily cron job. With a small fs (no sources, just the bare install), it seems ok. I do have kern.maxvnodes = 391680 to help with git and a repository with 269268 files. I wonder if this is what makes me odd. Is anyone else seeing anything that could be like this?
Attachment:
pgpWfqS2NODE7.pgp
Description: PGP signature