NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/44376: wm interface 82574 on Supermicro X8SIL (with Xeon L3406) -> kernel deadlock
>Number: 44376
>Category: kern
>Synopsis: wm interface 82574 on Supermicro X8SIL (with Xeon L3406) ->
>kernel deadlock
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jan 12 17:00:00 +0000 2011
>Originator: Dr. Wolfgang Stukenbrock
>Release: NetBSD 5.1, NetBSD-Head (12.01.2011)
>Organization:
Dr. Nagler & Company GmbH
>Environment:
System: NetBSD s0g7 5.99.43 NetBSD 5.99.43 (GENERIC) #0: Tue Jan 11 15:20:29
UTC 2011
builds%b7.netbsd.org@localhost:/home/builds/ab/HEAD/amd64/201101112100Z-obj/home/builds/ab/HEAD/src/sys/arch/amd64/compile/GENERIC
amd64
Architecture: x86_64
Machine: amd64
>Description:
We've tried to setup 5.1 on a new system with a Supermicro X8SIL board
and Xeon L3406 CPU.
This board uses the 3400 chip-set with two 82574L (rev 0) on board.
There are two makphy (88E1149 ver 1) present.
I've tried the original 5.1, a 5.1 with an updated wm driver (and some
other phy stuff required for this, including
some other headerfile extentions) and last an "original" GENERIC-kernel
from the NetBSD ftp server from Head. (System-Info
above is from that kernel.)
All theese setups leads to a kernel dead-lock without any further
interrupt procession after a short time if the onboard
82574 interfaces are used.
I've added a dual-port PCIe-card in order to get further information
about the cause (Intel PRO/1000 PT (82571EB)).
That one runs much more stable in all of the setups above. I've got
only one - not reproducable - "crash" - see below.
The scenario:
The system is setup as a gateway that is used to route between two
other networks.
Only "the normal" unix-stuff is running on it. (named, syslogd,
rpcbind, ypserv, ypbind, amd, ntpd, sshd, postfix, cron)
I don't beleave it is related to one ot them. If desired, I can stop
some of them ...
Most times the system simply freezes.
In some cases, the system reports problems when accessing the phy.
(This has happend once for the PCIe-card too.)
If that happens the system has continued working with the PCIe-card,
but freezes most cases a short time after that when
using the onboard interfaces.
I've got this already during system boot ... and there I haven't
started any ftp-test-transfer throught the system.
When WM_DEBUG with LINK_DEBUG is enabled, only the "normal" switch to
HDX is reported - nothing else ..
When RX and TX is eanbled too there is too much output and I've seen
some crashes with bad kernel-access - there
seems to a timing problem somewhere here ...
I've put some debug output into the 5.1 kernel that tracks the
kernel-lock, but I haven't found anything that
points to a problem.
Netherless the famous last words in this output are always a little bit
strange. Some (but not all in every case)
of the CPU's (and there are 4 (2 Cores with 2 Threads each)) are gooing
to lock the kernel-lock and stay there.
(remark: I've tried disabled hyperthreading of the second core - no
change. I haven't tested single-processor till now)
Not all the time the output in front of that reports that a CPU is
inside of the lock and my output that prints the
cpu-number that hold the lock when waiting says no CPU is in there.
Even if the output prio system freeze reports
that a CPU has entered the lock and is still in there.
The output on enter/leave in wm_intr() say that no wm-interrupt is
active at the time of freeze.
Something about have a minute before the system freezes, the
wm-interrupts suddendly does not come requlary any-more.
(the trafic request is still the same)
I've no additional ideas anymore how to go on with debugging anymore -
but I will if I get some new hints.
(We need that system in a productive setup soon, so I will lose the
system for testing during this month ...)
The output of my kernel-lock debugging looks like something with the
memory access and cache-sync gets out of sync.
But I cannot beleave this, because in that case the problem should also
happen on heavy load on the system without
network trafic and I haven't seen this up to now.
I've no idea where to place some debugging stuff to the the point where
the system stops processing of interrupts.
I've failed to add short printouts to the interrupt-stub routines, but
that is af cause my fault personal problem ...
At least wm_intr() is not called at that time - as long as printf() on
the serial console is working correctly.
(remark: when running on grafic console there seems to be much more
kernel-lock "activity" as on serial console.
And the famous last words on grafic are truncated in the
middle of a printf of 3 chars - on serial console
I've always seen the last printf completely ...)
Due to the fact that absolutly no interrupts are processed anymore, I
cannot get into DDB ...
Help ....
>How-To-Repeat:
Boot a current GENERIC kernel and do some network trafic on the HW
setup above.
The system will deadlock soon - no interrupts are processed anymore.
>Fix:
Accedently not known till now ...
Not even an idea how to go on with debugging ...
>Unformatted:
Home |
Main Index |
Thread Index |
Old Index