kern/44376: wm interface 82574 on Supermicro X8SIL (with Xeon L3406) -> kernel deadlock

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/44376: wm interface 82574 on Supermicro X8SIL (with Xeon L3406) -> kernel deadlock
From: Wolfgang.Stukenbrock%nagler-company.com@localhost
Date: Wed, 12 Jan 2011 17:00:00 +0000 (UTC)

>Number:         44376
>Category:       kern
>Synopsis:       wm interface 82574 on Supermicro X8SIL (with Xeon L3406) -> 
>kernel deadlock
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jan 12 17:00:00 +0000 2011
>Originator:     Dr. Wolfgang Stukenbrock
>Release:        NetBSD 5.1, NetBSD-Head (12.01.2011)
>Organization:
Dr. Nagler & Company GmbH
>Environment:
        
        
System: NetBSD s0g7 5.99.43 NetBSD 5.99.43 (GENERIC) #0: Tue Jan 11 15:20:29 
UTC 2011  
builds%b7.netbsd.org@localhost:/home/builds/ab/HEAD/amd64/201101112100Z-obj/home/builds/ab/HEAD/src/sys/arch/amd64/compile/GENERIC
 amd64
Architecture: x86_64
Machine: amd64
>Description:
        We've tried to setup 5.1 on a new system with a Supermicro X8SIL board 
and Xeon L3406 CPU.
        This board uses the 3400 chip-set with two 82574L (rev 0) on board. 
There are two makphy (88E1149 ver 1) present.

        I've tried the original 5.1, a 5.1 with an updated wm driver (and some 
other phy stuff required for this, including
        some other headerfile extentions) and last an "original" GENERIC-kernel 
from the NetBSD ftp server from Head. (System-Info
        above is from that kernel.)
        All theese setups leads to a kernel dead-lock without any further 
interrupt procession after a short time if the onboard
        82574 interfaces are used.
        I've added a dual-port PCIe-card in order to get further information 
about the cause (Intel PRO/1000 PT (82571EB)).
        That one runs much more stable in all of the setups above. I've got 
only one - not reproducable - "crash" - see below.

        The scenario:
        The system is setup as a gateway that is used to route between two 
other networks.
        Only "the normal" unix-stuff is running on it. (named, syslogd, 
rpcbind, ypserv, ypbind, amd, ntpd, sshd, postfix, cron)
        I don't beleave it is related to one ot them. If desired, I can stop 
some of them ...

        Most times the system simply freezes.
        In some cases, the system reports problems when accessing the phy. 
(This has happend once for the PCIe-card too.)
        If that happens the system has continued working with the PCIe-card, 
but freezes most cases a short time after that when
        using the onboard interfaces.
        I've got this already during system boot ... and there I haven't 
started any ftp-test-transfer throught the system.

        When WM_DEBUG with LINK_DEBUG is enabled, only the "normal" switch to 
HDX is reported - nothing else ..
        When RX and TX is eanbled too there is too much output and I've seen 
some crashes with bad kernel-access - there
        seems to a timing problem somewhere here ...

        I've put some debug output into the 5.1 kernel that tracks the 
kernel-lock, but I haven't found anything that
        points to a problem.
        Netherless the famous last words in this output are always a little bit 
strange. Some (but not all in every case)
        of the CPU's (and there are 4 (2 Cores with 2 Threads each)) are gooing 
to lock the kernel-lock and stay there.
        (remark: I've tried disabled hyperthreading of the second core - no 
change. I haven't tested single-processor till now)
        Not all the time the output in front of that reports that a CPU is 
inside of the lock and my output that prints the
        cpu-number that hold the lock when waiting says no CPU is in there. 
Even if the output prio system freeze reports
        that a CPU has entered the lock and is still in there.
        The output on enter/leave in wm_intr() say that no wm-interrupt is 
active at the time of freeze.
        Something about have a minute before the system freezes, the 
wm-interrupts suddendly does not come requlary any-more.
        (the trafic request is still the same)


        I've no additional ideas anymore how to go on with debugging anymore - 
but I will if I get some new hints.
        (We need that system in a productive setup soon, so I will lose the 
system for testing during this month ...)


        The output of my kernel-lock debugging looks like something with the 
memory access and cache-sync gets out of sync.
        But I cannot beleave this, because in that case the problem should also 
happen on heavy load on the system without
        network trafic and I haven't seen this up to now.

        I've no idea where to place some debugging stuff to the the point where 
the system stops processing of interrupts.
        I've failed to add short printouts to the interrupt-stub routines, but 
that is af cause my fault personal problem ...
        At least wm_intr() is not called at that time - as long as printf() on 
the serial console is working correctly.
        (remark: when running on grafic console there seems to be much more 
kernel-lock "activity" as on serial console.
                 And the famous last words on grafic are truncated in the 
middle of a printf of 3 chars - on serial console
                 I've always seen the last printf completely ...)
        Due to the fact that absolutly no interrupts are processed anymore, I 
cannot get into DDB ...

        Help ....
>How-To-Repeat:
        Boot a current GENERIC kernel and do some network trafic on the HW 
setup above.
        The system will deadlock soon - no interrupts are processed anymore.
>Fix:
        Accedently not known till now ...
        Not even an idea how to go on with debugging ...

>Unformatted:

Prev by Date: Re: port-mips/44375: NetBSD/mips 1.6 binary doesn't work on NetBSD/mips -current
Next by Date: kern/44377: union whiteouts don't work on ffs -o log
Previous by Thread: Re: port-mips/44375: NetBSD/mips 1.6 binary doesn't work on NetBSD/mips -current
Next by Thread: kern/44377: union whiteouts don't work on ffs -o log
Indexes:

Home | Main Index | Thread Index | Old Index