Subject: Re: bridge(4) and silent data corruption :-(
To: Sean Doran <smd@ab.use.net>
From: Dennis Ferguson <dennis@juniper.net>
List: tech-net
Date: 05/01/2002 12:55:00
Sean,

>   I'm still not sure whether it's the bridge itself, or
> whether somehow the router knows there's a bridge and
> is doing something toxic.   (I can't really imagine how it would
> detect a simple learning bridge though, particularly if
> there's a hub or switch in the way).

I probably wasn't clear enough.  Here's an entirely hypothetical theory
which matches your symptoms.

The ethernet adaptor in your router has a problem with occasionally makes
it mangle packets.  The particular way the packet is mangled is that chunks
of the packet are reordered and the CRC is incorrect.  No one notices the
problem because it doesn't happen often and because the mangled packets
are normally dropped by the receiver due to the CRC.  The mangling
does, however, leave the TCP/UDP checksum intact because the packet data is
reordered but otherwise unchanged.

The reason this is a problem through the bridge is that putting the
interfaces on the NetBSD box in promiscuous mode causes packet CRC errors
to be ignored (I have no idea if this is so, but I do know that allowing
tcpdump and other listeners to receive errored packets is useful and this
was the only application promiscuous mode was used for prior to the bridge
code).  That means packets through the NetBSD bridge are received with the
bad CRC but are forwarded out the other interface with a good CRC being
regenerated.

If this were the case then connections not using the router would be okay
(the router mangles the packets), and connections using the router but not
crossing the bridge would be okay (packets which are mangled would be
dropped with a bad CRC).  Only packets passing through both the router and
the bridge would be a problem.  I think this matches your symptoms without
the router having to know about a bridge it can't see and vice versa.  I
think there are other ways to end up with the same problem too, this is
just the first one that came to mind.

>   You don't have to sell me on the evils of ATM :) but I don't
> see how it can be blamed, since on the "outside" of the bridge an
> scp2 file transfer through the router will complete fine, and on
> the "inside" it will fail.  I didn't see anything in the code walk
> I did that distinguishes between local-to-this-LAN traffic
> and delivered-beyond-this-LAN stuff, although I didn't get to
> look for arp magic and the like.

Actually I only brought up ATM since I am interested in the ways hardware
and software can fail causing damage to packets that the TCP checksum won't
catch, and having hardware which is overly attached to ATM makes this easier.
Cheap DSL routers can have this property.  Note that, whatever the cause,
you are getting packets which pass their TCP checksum but fail the
cryptographic authentication checksum, so if there is no NAT box that
gratuitously screws with packet contents in the path anywhere then this has
the makings of being a very interesting failure if you can figure out what
it is.

> | (I know this is stretching a bit).
> 
>   Well, I'm relieved you're stumped too. :-)

I can still think up theories that match the symptoms.  Theories are
cheap, however, and the real cause will be way more interesting than
those if you can figure it out.

Note that you'll want to be careful with this arrangement since the failures
you get running with ssh will turn into undetected data corruption without
ssh.

Dennis Ferguson