Subject: Re: wm0 lossage in SGI O2
To: Rafal Boni <rafal@attbi.com>
From: Andrey Petrov <petrov@netbsd.org>
List: tech-net
Date: 03/17/2003 12:10:40
On Sun, Mar 16, 2003 at 06:07:17PM -0500, Rafal Boni wrote:
> Folks:
> 	I'm having odd troubles with my Intel GigE PCI-X card in the PCI
> 	slot of my O2.  Most of the times, it works fairly well, but once
> 	in a while, it gets wedged in an odd way... I suspect come cache-
> 	related foo on the O2, but thought I'd solicit other ideas too.
> 

That fits very well to cache-coherency problem, I had exactly the same
symptoms on sparc64/hme when receive descriptor was not flushed out of
iommu streaming cache. Interrupt handler didn't see received packet because
descriptor is not synced, driver and card became out of sync on 'next
receive descriptor' and all stuck at this place. You can make wm driver
more resellient by working around out of sync 'next receive descriptor',
but be better of finding why it is not flushed to memory.

	Andrey


> 	When it hangs, it's still generating RX interrupts (I checked this
> 	by compiling with WM_DEBUG defined, wm_debug initially 0 and then
> 	setting wm_debug via DDB to print ~ all events of stuff when I get
> 	into the hung state), and pinging the machine from another system
> 	generates ~ the right number of RX interrupts.
> 
> 	However, the system is wedged checking a specific descriptor and
> 	does not make it past there:
> 
> wm0: RX: checking descriptor 63
> wm0: RX: rxptr -> 63
> wm0: TX: txsdirty -> 44
> [...repeats, with one appearing when new packets show up...]
> 
> 	Interestingly, in this state, the TX side of the card still works, 
> 	but unless I ifconfig down/up the card, receive stays in the hung
> 	state.
> 
> 	This seems somewhat specific to the Intel GigE card, as the cheapo
> 	realtek I was using before worked fine, but I but I doubt it's the
> 	card itself, as it works well most of the time...
> 
> 	The NIC is connected to a 100/FDX switch, media is set to `auto'; 
> 	the NIC/PHY identify themselves as:
> 
> wm0 at pci0 dev 3 function 0: Intel i82544EI 1000BASE-T Ethernet, rev. 2
> wm0: interrupting at crime interrupt 10
> wm0: Ethernet address 00:02:b3:xx:xx:xx
> makphy0 at wm0 phy 1: Marvell 88E1000 Gigabit PHY, rev. 0
> makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT, 1000baseT-FDX, auto
> 
> Any thoughts?
> --rafal
> 
> ----
> Rafal Boni                                                     rafal@attbi.com
>   We are all worms.  But I do believe I am a glowworm.  -- Winston Churchill
>