Subject: Re: Data corruption issues possibly involving cgd(4)
To: Roland Dowdeswell <elric@imrryr.org>
From: Nino Dehne <ndehne@gmail.com>
List: current-users
Date: 01/16/2007 07:19:20
Hi Roland,

On Tue, Jan 16, 2007 at 01:08:32AM -0500, Roland Dowdeswell wrote:
> >The issue manifests as follows:
> >
> >1) Repeatedly hashing a large file residing on the crypted partition
> >   occasionally yields a bad checksum. The problem can be reproduced by
> >   repeatedly checking a large .rar file or .flac file as well.
> >   The file is large enough not to fit in RAM and disks are active 100% of
> >   the time.
> >   Sometimes the wrong hash occurs at the 5th run, sometimes 20 runs are
> >   needed. Sometimes two bad runs occur in succession.
> >2) same as 1) but the file fits into RAM so that subsequent hashes don't
> >   hit the disk: the problem does _not_ occur. Tested with over 2000 runs of
> >   md5 <file>.
> >3) same as 1) but the file resides on a non-cgd partition on a RAID1 using
> >   raid(4): the problem also does _not_ occur. I aborted the hashing after
> >   100 runs where the problem would have shown up with certainty in 1).
> >4) memtest86+ runs without errors.
> >5) mprime[1] runs without errors.
> >6) build.sh release not involving the cgd partition runs without errors.
> 
> Okay, so CGD does live under the buffer cache so (2) will not be
> causing any encryption to occur.

That's what I figured as well.


> The only thing that I can think of might be that there are some
> kinds of memory errors that occurred a number of years ago under
> particular usage patterns, e.g. gcc, which memtest did not catch.
> Otherwise, it does seem that CGD might be the obvious culprit---but
> that said, there's nothing in the code path, I think, that is not
> deterministic.
> 
> Can you reproduce this issue on another system or is it just this
> one?

Transferring the system to other hardware will be a bit of a hassle. I will
see to it.

Thanks and regards,

ND