Subject: Data corruption issues possibly involving cgd(4)
To: None <current-users@netbsd.org>
From: Nino Dehne <ndehne@gmail.com>
List: current-users
Date: 01/16/2007 06:58:09
Hi there,

I am currently experiencing data corruption using 4.0_BETA2 from around
mid-december.

cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD Unknown K7 (Athlon) (686-class), 2000.30 MHz, id 0x40fb2
cpu0: features ffdbfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features ffdbfbff<PGE,MCA,CMOV,PAT,PSE36,MPC,NOX,MMXX,MMX>
cpu0: features ffdbfbff<FXSR,SSE,SSE2,B27,HTT,LONG,3DNOW2,3DNOW>
cpu0: features2 2001<SSE3>
cpu0: "AMD Athlon(tm) 64 X2 Dual Core Processor 3600+"
cpu0: I-cache 64 KB 64B/line 2-way, D-cache 64 KB 64B/line 2-way
cpu0: L2 cache 256 KB 64B/line 16-way
cpu0: ITLB 32 4 KB entries fully associative, 8 4 MB entries fully associative
cpu0: DTLB 32 4 KB entries fully associative, 8 4 MB entries fully associative
cpu0: AMD Power Management features: 3f<STC,TM,TTP,VID,FID,TS>
cpu0: AMD PowerNow! Technology 2000 MHz
cpu0: available frequencies (Mhz): 1000 1800 2000
cpu0: calibrating local timer
cpu0: apic clock running at 200 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: not started

ACPI is enabled. This is an Athlon 64 X2 3600+ EE on an ASRock ALiveSATA2-GLAN
and an MDT 512M stick of DDR2-800 RAM. The box has 5 drives in a RAID5 using
raid(4) and a cgd(4) on top of that.

The issue manifests as follows:

1) Repeatedly hashing a large file residing on the crypted partition
   occasionally yields a bad checksum. The problem can be reproduced by
   repeatedly checking a large .rar file or .flac file as well.
   The file is large enough not to fit in RAM and disks are active 100% of
   the time.
   Sometimes the wrong hash occurs at the 5th run, sometimes 20 runs are
   needed. Sometimes two bad runs occur in succession.
2) same as 1) but the file fits into RAM so that subsequent hashes don't
   hit the disk: the problem does _not_ occur. Tested with over 2000 runs of
   md5 <file>.
3) same as 1) but the file resides on a non-cgd partition on a RAID1 using
   raid(4): the problem also does _not_ occur. I aborted the hashing after
   100 runs where the problem would have shown up with certainty in 1).
4) memtest86+ runs without errors.
5) mprime[1] runs without errors.
6) build.sh release not involving the cgd partition runs without errors.

Since I noticed sys/arch/x86/x86/errata.c on HEAD, at first I thought the
CPU might be affected by it. So I tried booting a -current GENERIC.MPACPI
kernel using boot -d. This did not give anything in dmesg.

SMP vs. UP kernel makes no difference.
Setting machdep.powernow.frequency.target to 1000 or 2000 makes no difference.

Judging from 2) to 6) I can exclude heating issues or something related
to concurrent hash calculations and disk access. envstat reports CPU below
40°C at all times.

Please help, I'm at a loss.

Best regards,

ND


[1] http://www.mersenne.org/freesoft.htm