Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD-current on amd64 with Dell PERC 4e/Di hangs under load



On Jan 28,  7:37pm, tih%hamartun.priv.no@localhost (Tom Ivar Helbekkmo) wrote:
-- Subject: Re: NetBSD-current on amd64 with Dell PERC 4e/Di hangs under load

| Christos Zoulas <christos%zoulas.com@localhost> writes:
| 
| > Can you boot with a single processor?  Let's try to simplify the
| > workload.
| 
| Should have thought of that myself.  It's running with SMP disabled now
| (as a boot option; I haven't done anything to the BIOS configuration),
| and this is very interesting.  I've got all my regular software running,
| plus a full system build with "-j 4", to make sure it's kept really
| busy, and it's showing up hangs.
| 
| However: the hangs are short (5 to 10 seconds, typically, although I've
| seen almost 20 a couple of times), and occur at varying intervals,
| seemingly depending on how much disk access is going on: more often when
| more is being written to disk.  Best of all, when it hangs, the system
| seems totally unresponsive, neither answering ICMP ECHOs nor echoeing
| keypresses on the console, but it *is* accessing the disks!  The disk
| lamps flicker, indicating that it's writing stuff, and then, presumably
| when it's gone through the outstanding writes, the machine continues to
| run other tasks.  Here's a typical snapshot from the ping(1) I've got
| running on a window on my workstation:
| 
| 64 bytes from 193.71.27.8: icmp_seq=1190 ttl=255 time=0.676274 ms
| 64 bytes from 193.71.27.8: icmp_seq=1191 ttl=255 time=0.684655 ms
| 64 bytes from 193.71.27.8: icmp_seq=1192 ttl=255 time=0.723203 ms
| 64 bytes from 193.71.27.8: icmp_seq=1193 ttl=255 time=0.727393 ms
| 64 bytes from 193.71.27.8: icmp_seq=1194 ttl=255 time=8344.118699 ms
| 64 bytes from 193.71.27.8: icmp_seq=1195 ttl=255 time=7344.353790 ms
| 64 bytes from 193.71.27.8: icmp_seq=1196 ttl=255 time=6334.641699 ms
| 64 bytes from 193.71.27.8: icmp_seq=1197 ttl=255 time=5335.350267 ms
| 64 bytes from 193.71.27.8: icmp_seq=1198 ttl=255 time=4335.631450 ms
| 64 bytes from 193.71.27.8: icmp_seq=1199 ttl=255 time=3335.894195 ms
| 64 bytes from 193.71.27.8: icmp_seq=1200 ttl=255 time=2335.999395 ms
| 64 bytes from 193.71.27.8: icmp_seq=1201 ttl=255 time=1336.099567 ms
| 64 bytes from 193.71.27.8: icmp_seq=1202 ttl=255 time=336.195548 ms
| 64 bytes from 193.71.27.8: icmp_seq=1203 ttl=255 time=0.911755 ms
| 64 bytes from 193.71.27.8: icmp_seq=1204 ttl=255 time=0.553925 ms
| 64 bytes from 193.71.27.8: icmp_seq=1205 ttl=255 time=0.555601 ms
| 
| Now, when I'm in SMP mode, the disk lights do *not* flicker while it
| hangs, so we're dealing with a) something that causes the amr driver to
| periodically take over completely, probably while it's flushing dirty
| blocks to the disks, and b) something that causes this situation to lead
| to much longer (and possibly even permanent) hangs when running on
| multiple processors.
| 
| Cool!  :)
| 
| I'm going to look long and hard at /sys/dev/pci/amr.c again, and see if
| I can figure out some good way to instrument it further -- but I hope
| you will try to understand why it seems to be stopping everything else
| while it's chugging through a bunch of outstanding disk operations --
| and maybe even why this would get it into such big trouble with SMP.

Excellent! This sounds like a very interesting problem... I am being
pulled in every which direction right now, so I don't have much time
to look into it, but I'll try to do so over the weekend (look at amr.c).

Good luck!

christos


Home | Main Index | Thread Index | Old Index