Subject: Re: TCP checksum not good enough?
To: None <netbsd-users@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: netbsd-users
Date: 08/02/2006 16:40:40
On Wed, Aug 02, 2006 at 04:34:11PM -0400, Charles M. Hannum wrote:
> On Wed, Aug 02, 2006 at 01:20:38PM -0700, Andy Ruhl wrote:
> > A very large database was being backed up over a TCP/IP network, and
> > the restore of it to a test system would often be corrupt. This
> > prompted some very heated conversations, let's say.
> ...
> > Which means this data passed through whatever hardware checking was
> > done (not known to me exactly what, if any, there is) AND TCP
> > checksumming.
>
> Ethernet uses a stronger checksum than TCP (32 bits vs. 16). If you're
> not also seeing errors on the interface and/or in your TCP stats, then
> the problem is most likely occuring on one of the hosts. Run a memory
> test.
We had a problem a few years ago on one of the project's servers where
a bad PCI bus bridge was corrupting data. The symptom was that disk
blocks from the controller on that bus would _very occasionally_ be
scrambled. But we couldn't tell whether the problem was the disk
controller or the bus -- memory tests would run without incident for
days. As a test, we turned on hardware TCP checksum offload on the
Ethernet controller on the same bus, and, surprise surprise, we'd see
corruption in received network packets -- because the controller
checked the checksum, which was correct, *before* the data came across
the bus, which scrambled them.
In other words, Charles' advice is good advice but there are a lot of
places that data could get scrambled that a memory test may not find.
--
Thor Lancelot Simon tls@rek.tjls.com
"We cannot usually in social life pursue a single value or a single moral
aim, untroubled by the need to compromise with others." - H.L.A. Hart