Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD-current on amd64 with Dell PERC 4e/Di hangs under load



Christos Zoulas <christos%zoulas.com@localhost> writes:

> Can you boot with a single processor?  Let's try to simplify the
> workload.

Should have thought of that myself.  It's running with SMP disabled now
(as a boot option; I haven't done anything to the BIOS configuration),
and this is very interesting.  I've got all my regular software running,
plus a full system build with "-j 4", to make sure it's kept really
busy, and it's showing up hangs.

However: the hangs are short (5 to 10 seconds, typically, although I've
seen almost 20 a couple of times), and occur at varying intervals,
seemingly depending on how much disk access is going on: more often when
more is being written to disk.  Best of all, when it hangs, the system
seems totally unresponsive, neither answering ICMP ECHOs nor echoeing
keypresses on the console, but it *is* accessing the disks!  The disk
lamps flicker, indicating that it's writing stuff, and then, presumably
when it's gone through the outstanding writes, the machine continues to
run other tasks.  Here's a typical snapshot from the ping(1) I've got
running on a window on my workstation:

64 bytes from 193.71.27.8: icmp_seq=1190 ttl=255 time=0.676274 ms
64 bytes from 193.71.27.8: icmp_seq=1191 ttl=255 time=0.684655 ms
64 bytes from 193.71.27.8: icmp_seq=1192 ttl=255 time=0.723203 ms
64 bytes from 193.71.27.8: icmp_seq=1193 ttl=255 time=0.727393 ms
64 bytes from 193.71.27.8: icmp_seq=1194 ttl=255 time=8344.118699 ms
64 bytes from 193.71.27.8: icmp_seq=1195 ttl=255 time=7344.353790 ms
64 bytes from 193.71.27.8: icmp_seq=1196 ttl=255 time=6334.641699 ms
64 bytes from 193.71.27.8: icmp_seq=1197 ttl=255 time=5335.350267 ms
64 bytes from 193.71.27.8: icmp_seq=1198 ttl=255 time=4335.631450 ms
64 bytes from 193.71.27.8: icmp_seq=1199 ttl=255 time=3335.894195 ms
64 bytes from 193.71.27.8: icmp_seq=1200 ttl=255 time=2335.999395 ms
64 bytes from 193.71.27.8: icmp_seq=1201 ttl=255 time=1336.099567 ms
64 bytes from 193.71.27.8: icmp_seq=1202 ttl=255 time=336.195548 ms
64 bytes from 193.71.27.8: icmp_seq=1203 ttl=255 time=0.911755 ms
64 bytes from 193.71.27.8: icmp_seq=1204 ttl=255 time=0.553925 ms
64 bytes from 193.71.27.8: icmp_seq=1205 ttl=255 time=0.555601 ms

Now, when I'm in SMP mode, the disk lights do *not* flicker while it
hangs, so we're dealing with a) something that causes the amr driver to
periodically take over completely, probably while it's flushing dirty
blocks to the disks, and b) something that causes this situation to lead
to much longer (and possibly even permanent) hangs when running on
multiple processors.

Cool!  :)

I'm going to look long and hard at /sys/dev/pci/amr.c again, and see if
I can figure out some good way to instrument it further -- but I hope
you will try to understand why it seems to be stopping everything else
while it's chugging through a bunch of outstanding disk operations --
and maybe even why this would get it into such big trouble with SMP.

-tih
-- 
Popularity is the hallmark of mediocrity.  --Niles Crane, "Frasier"


Home | Main Index | Thread Index | Old Index