Subject: Re: BUG IN IF_ED DRIVER PERSISTS UNTIL TODAY.
To: Charles M. Hannum <mycroft@mit.edu>
From: Brian Buhrow <buhrow@cats.ucsc.edu>
List: tech-kern
Date: 08/30/1996 13:18:41
	It might be true that my quotations of the code are not as good as they could
be, but my description of what I'm seeing stands.  That is, during
particularly busy traffic flows with this card, (I have a couple of sample
cards that fail in the same way.), I see the 
message:

Aug 25 16:50:55 baloo /netbsd: ed2: remote transmit DMA failed to complete

At which point, if the packet in question happens to be an outgoing packet
for a telnet session, or any other tcp session, that tcp flow, in one
direction only, is hosed due to the fact that baloo, the NetBSD host in
question, is sending out garbage retransmits.  I have a trace, as taken by
a network general sniffer, showing this very predicable behavior.

I admit that I don't entirely understand how the NetBSD if_ed driver could
be mangling the packet so badly, but something certainly is, and it happens
only when the driver seems to cough.
-Brian

On Aug 30,  3:36pm, Charles M. Hannum wrote:
} Subject: Re: BUG IN IF_ED DRIVER PERSISTS UNTIL TODAY.
} 
} buhrow@cats.ucsc.edu (Brian Buhrow) writes:
} 
} > 
} > 	The problem is  in the handling of a hardware error condition.  If the
} > card resets during a particularly busy set of traffic flows, the ring
} > buffer pointers can get balloxed up, causing data corruption in the
} > outgoing packet.  While TCP will detect that the packet didn't make it to
} > its destination, it will wrongly resend the generated packet, which is now
} > garbage, thanks to the fine chip makers at National Semiconductor, which
} > won't get through because the packet doesn't pass the IP checksum, which,
} > of course, it shouldn't.
} > 	The problem in the driver is that if it resets the card, due to the chip's 
} > failure, it doesn't return a different status to the sending output
} > routine.  Here's the relevant section of the driver.
} > routing.
} >
} > [...]
} 
} The first piece of code you quote has to do with packet reception.
} This code `can't fail', unless we are out of mbufs, in which case the
} packet is silently dropped.  I don't see a problem here.
} 
} The second piece of code has to do with packet transmission.  This
} code doesn't modify the mbufs in kernel memory, so there's no way it
} could corrupt the retransmissions.  So far, I still don't see a
} problem.
} 
} The one place I do see a (minor) problem is that, even if the transmit
} DMA fails to complete, we send whatever happens to be in the device's
} memory as a packet (*once*).  This will almost certainly result in the
} packet being dropped whereever it arrives.  So far, I *still* don't
} see a problem.  In addition, unless you're actually seeing `remote
} transmit DMA failed to complete' messages, this isn't relevant at all.
} (This should be fixed anyway, of course.)
} 
} I believe your analysis is incorrect.
} 
>-- End of excerpt from Charles M. Hannum