Subject: Re: Data corruption issues possibly involving cgd(4)
To: Roland Dowdeswell <email@example.com>
From: Nino Dehne <firstname.lastname@example.org>
Date: 01/16/2007 07:19:20
On Tue, Jan 16, 2007 at 01:08:32AM -0500, Roland Dowdeswell wrote:
> >The issue manifests as follows:
> >1) Repeatedly hashing a large file residing on the crypted partition
> > occasionally yields a bad checksum. The problem can be reproduced by
> > repeatedly checking a large .rar file or .flac file as well.
> > The file is large enough not to fit in RAM and disks are active 100% of
> > the time.
> > Sometimes the wrong hash occurs at the 5th run, sometimes 20 runs are
> > needed. Sometimes two bad runs occur in succession.
> >2) same as 1) but the file fits into RAM so that subsequent hashes don't
> > hit the disk: the problem does _not_ occur. Tested with over 2000 runs of
> > md5 <file>.
> >3) same as 1) but the file resides on a non-cgd partition on a RAID1 using
> > raid(4): the problem also does _not_ occur. I aborted the hashing after
> > 100 runs where the problem would have shown up with certainty in 1).
> >4) memtest86+ runs without errors.
> >5) mprime runs without errors.
> >6) build.sh release not involving the cgd partition runs without errors.
> Okay, so CGD does live under the buffer cache so (2) will not be
> causing any encryption to occur.
That's what I figured as well.
> The only thing that I can think of might be that there are some
> kinds of memory errors that occurred a number of years ago under
> particular usage patterns, e.g. gcc, which memtest did not catch.
> Otherwise, it does seem that CGD might be the obvious culprit---but
> that said, there's nothing in the code path, I think, that is not
> Can you reproduce this issue on another system or is it just this
Transferring the system to other hardware will be a bit of a hassle. I will
see to it.
Thanks and regards,