Subject: Re: TCP checksum not good enough?
To: None <netbsd-users@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: netbsd-users
Date: 08/02/2006 16:40:40
On Wed, Aug 02, 2006 at 04:34:11PM -0400, Charles M. Hannum wrote:
> On Wed, Aug 02, 2006 at 01:20:38PM -0700, Andy Ruhl wrote:
> > A very large database was being backed up over a TCP/IP network, and
> > the restore of it to a test system would often be corrupt. This
> > prompted some very heated conversations, let's say.
> ...
> > Which means this data passed through whatever hardware checking was
> > done (not known to me exactly what, if any, there is) AND TCP
> > checksumming.
> 
> Ethernet uses a stronger checksum than TCP (32 bits vs. 16).  If you're
> not also seeing errors on the interface and/or in your TCP stats, then
> the problem is most likely occuring on one of the hosts.  Run a memory
> test.

We had a problem a few years ago on one of the project's servers where
a bad PCI bus bridge was corrupting data.  The symptom was that disk
blocks from the controller on that bus would _very occasionally_ be
scrambled.  But we couldn't tell whether the problem was the disk
controller or the bus -- memory tests would run without incident for
days.  As a test, we turned on hardware TCP checksum offload on the
Ethernet controller on the same bus, and, surprise surprise, we'd see
corruption in received network packets -- because the controller
checked the checksum, which was correct, *before* the data came across
the bus, which scrambled them.

In other words, Charles' advice is good advice but there are a lot of
places that data could get scrambled that a memory test may not find.

-- 
  Thor Lancelot Simon	                                     tls@rek.tjls.com

  "We cannot usually in social life pursue a single value or a single moral
   aim, untroubled by the need to compromise with others."      - H.L.A. Hart