Subject: Re: Data corruption issues possibly involving cgd(4)
To: David Laight <david@l8s.co.uk>
From: Nino Dehne <ndehne@gmail.com>
List: current-users
Date: 01/16/2007 20:49:14
On Tue, Jan 16, 2007 at 09:28:21AM +0000, David Laight wrote:
> On Tue, Jan 16, 2007 at 08:00:14AM +0100, Nino Dehne wrote:
> > 
> > After 50 runs of dd if=/dev/rcgd0d bs=65536 count=4096 | md5 and no error
> > I aborted the test. Replacing rcgd0d with cgd0a made no difference.
> > While not necessary IMO, I tried the same with rraid1d, no errors either
> > after 50 runs. For comparison, a loop on the filesystem on the cgd aborted
> > after the 14th run now.
> > 
> > So the issue doesn't seem to be related to the power supply either and
> > frankly, it's starting to freak me out.
> 
> The 'dd' will be doing sequential reads, whereas the fs version will be doing
> considerable numbers of seeks.  It is the seeks that cause the disks to
> draw current bursts from the psu - so don't discount that.

Good point. To accommodate to that, I repeatedly cat'ed the test file on the
cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
in a loop. I used 64M instead of 256M because the disk thrashing was really
bad. I also set the CPU frequency to its maximum to maximize the power the
system draws.

The results were as follows:

f7abc41f7514946306a6aeddca8cb704
[about 70 occurrences of the same checksum]
f7abc41f7514946306a6aeddca8cb704
102e34f6c25d4fc135da03e26d4feff0
cfc1dc011ccff0e82fc5aa5a69173bd0
cfc1dc011ccff0e82fc5aa5a69173bd0
cfc1dc011ccff0e82fc5aa5a69173bd0

I attribute the checksum change to changes on the filesystem, since that was
obviously mounted while doing the test. Getting over 70 equal checksums and
then 3 equal other checksums in a row with flaky hardware seems highly
improbable to me.

In comparison, a loop of hashes on the file itself afterwards gave the
following result:

82d964b8d0cd2f60041067fc9263c1d7
82d964b8d0cd2f60041067fc9263c1d7
686d81e7362114475427b7fff2aec4fb
82d964b8d0cd2f60041067fc9263c1d7
82d964b8d0cd2f60041067fc9263c1d7
82d964b8d0cd2f60041067fc9263c1d7
82d964b8d0cd2f60041067fc9263c1d7

i.e. mismatch at the 3rd run. I seriously doubt that the 70+ successful runs
on the rcgd0d device were pure luck.

Please, anyone. :(

Best regards,

ND