current-users: Re: Data corruption issues possibly involving cgd(4)

Subject: Re: Data corruption issues possibly involving cgd(4)
To: Nino Dehne <ndehne@gmail.com>
From: Roland Dowdeswell <elric@imrryr.org>
List: current-users
Date: 01/16/2007 01:08:32
On 1168927089 seconds since the Beginning of the UNIX epoch
Nino Dehne wrote:
>
>Hi there,
>

>ACPI is enabled. This is an Athlon 64 X2 3600+ EE on an ASRock ALiveSATA2-GLAN
>and an MDT 512M stick of DDR2-800 RAM. The box has 5 drives in a RAID5 using
>raid(4) and a cgd(4) on top of that.

This is the correct layering.

>The issue manifests as follows:
>
>1) Repeatedly hashing a large file residing on the crypted partition
>   occasionally yields a bad checksum. The problem can be reproduced by
>   repeatedly checking a large .rar file or .flac file as well.
>   The file is large enough not to fit in RAM and disks are active 100% of
>   the time.
>   Sometimes the wrong hash occurs at the 5th run, sometimes 20 runs are
>   needed. Sometimes two bad runs occur in succession.
>2) same as 1) but the file fits into RAM so that subsequent hashes don't
>   hit the disk: the problem does _not_ occur. Tested with over 2000 runs of
>   md5 <file>.
>3) same as 1) but the file resides on a non-cgd partition on a RAID1 using
>   raid(4): the problem also does _not_ occur. I aborted the hashing after
>   100 runs where the problem would have shown up with certainty in 1).
>4) memtest86+ runs without errors.
>5) mprime[1] runs without errors.
>6) build.sh release not involving the cgd partition runs without errors.

Okay, so CGD does live under the buffer cache so (2) will not be
causing any encryption to occur.

The only thing that I can think of might be that there are some
kinds of memory errors that occurred a number of years ago under
particular usage patterns, e.g. gcc, which memtest did not catch.
Otherwise, it does seem that CGD might be the obvious culprit---but
that said, there's nothing in the code path, I think, that is not
deterministic.

Can you reproduce this issue on another system or is it just this
one?

>Since I noticed sys/arch/x86/x86/errata.c on HEAD, at first I thought the
>CPU might be affected by it. So I tried booting a -current GENERIC.MPACPI
>kernel using boot -d. This did not give anything in dmesg.
>
>SMP vs. UP kernel makes no difference.
>Setting machdep.powernow.frequency.target to 1000 or 2000 makes no difference.
>
>Judging from 2) to 6) I can exclude heating issues or something related
>to concurrent hash calculations and disk access. envstat reports CPU below
>40°C at all times.
>
>Please help, I'm at a loss.
>
>Best regards,
>
>ND
>
>
>[1] http://www.mersenne.org/freesoft.htm
>

--
    Roland Dowdeswell                      http://www.Imrryr.ORG/~elric/