Subject: BUG IN IF_ED DRIVER PERSISTS UNTIL TODAY.
To: None <tech-kern@NetBSD.ORG>
From: Brian Buhrow <buhrow@cats.ucsc.edu>
List: tech-kern
Date: 08/19/1996 17:30:48
Hello folks. A while back I wrote about a bug in the if_ed driver shipped
in the if_ed driver of NetBSD0.9A which, if your NE2000 is vulnerable,
causes your TCP sessions to be randomly hosed. I was told by some people
on this list to shut up about such old bugs and that if I wanted help with
things, I'd better run a modern versionn. While I'm still running the
archaic version, because it just works and I haven't the time to upgrade it
and support the applications necessary to support on it, I have looked at
the modern version of the if_ed driver, and believe the bug is still
present and waiting to byte someone in their telnet session.
The problem is in the handling of a hardware error condition. If the
card resets during a particularly busy set of traffic flows, the ring
buffer pointers can get balloxed up, causing data corruption in the
outgoing packet. While TCP will detect that the packet didn't make it to
its destination, it will wrongly resend the generated packet, which is now
garbage, thanks to the fine chip makers at National Semiconductor, which
won't get through because the packet doesn't pass the IP checksum, which,
of course, it shouldn't.
The problem in the driver is that if it resets the card, due to the chip's
failure, it doesn't return a different status to the sending output
routine. Here's the relevant section of the driver.
routing.
/* $NetBSD: if_ed.c,v 1.100 1996/05/12 23:52:19 mycroft Exp $ */
/*
* Device driver for National Semiconductor DS8390/WD83C690 based ethernet
* adapters.
*
* Copyright (c) 1994, 1995 Charles M. Hannum. All rights reserved.
*
* Copyright (C) 1993, David Greenman. This software may be used, modified,
[snip, snip, snip]
/*
* Be fairly liberal about what we allow as a "reasonable"
* length so that a [crufty] packet will make it to BPF (and
* can thus be analyzed). Note that all that is really
* important is that we have a length that will fit into one
* mbuf cluster or less; the upper layer protocols can then
* figure out the length from their own length field(s).
*/
if (len <= MCLBYTES &&
packet_hdr.next_packet >= sc->rec_page_start &&
packet_hdr.next_packet < sc->rec_page_stop) {
/* Go get packet. */
edread(sc, packet_ptr + sizeof(struct ed_ring),
len - sizeof(struct ed_ring));
} else {
/* Really BAD. The ring pointers are corrupted. */
log(LOG_ERR,
"%s: NIC memory corrupt - invalid packet length %d\n",
sc->sc_dev.dv_xname, len);
++sc->sc_arpcom.ac_if.if_ierrors;
edreset(sc);
return;
}
[snip snip snip]
/*
* Wait for remote DMA complete. This is necessary because on the
* transmit side, data is handled internally by the NIC in bursts and
* we can't start another remote DMA until this one completes. Not
* waiting causes really bad things to happen - like the NIC
* irrecoverably jamming the ISA bus.
*/
while (((NIC_GET(bc, ioh, nicbase, ED_P0_ISR) & ED_ISR_RDC) !=
ED_ISR_RDC) && --maxwait);
if (!maxwait) {
log(LOG_WARNING,
"%s: remote transmit DMA failed to complete\n",
sc->sc_dev.dv_xname);
edreset(sc);
}
return (len);
}
In either case, the card gets reset, and the packet gets stomped on. If
the packet is a tcp packet, the higher level layers will try to resend this
mbuf of garbage, and your tcp session is toast.
Is my assessment wrong here, or because the value that if_ed returnes to
the caller is OK, does the problem still exist? Or, does this driver not
tickle the chips in the same way? Right now, I have two ethernet cards
which respond the same way to the 0.9A driver, which is to say, they get
reset by the driver for mis-behaving, and consequently step on the outgoing
data.
Any suggestions, pointers at the error in my logic, etc. would be
greatly appreciated.
-thanks
-Brian
BTW, I have a trace of a session getting hosed in this way. I also can
reproduce it at will.
-Brian