current-users: Re: Data corruption issues possibly involving cgd(4)

Subject: Re: Data corruption issues possibly involving cgd(4)
To: Daniel Carosone <dan@geek.com.au>
From: Nino Dehne <ndehne@gmail.com>
List: current-users
Date: 01/16/2007 22:44:14

On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
> > > considerable numbers of seeks.  It is the seeks that cause the disks to
> > > draw current bursts from the psu - so don't discount that.
> > 
> > Good point. To accommodate to that, I repeatedly cat'ed the test file on the
> > cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
> > in a loop. I used 64M instead of 256M because the disk thrashing was really
> > bad. I also set the CPU frequency to its maximum to maximize the power the
> > system draws.
> 
> a cpu-hog process would help here too..

While doing the above, the CPU is about 0%-8% idle. I'm still running a
UP kernel.


> > I attribute the checksum change to changes on the filesystem, since that was
> > obviously mounted while doing the test. 
> 
> Probably, yeah; I gave some suggestions for ways to avoid this a
> moment ago, too.

I'll have a look. Your other mail just arrived due to connectivity problems
earlier.


> > Getting over 70 equal checksums and then 3 equal other checksums in
> > a row with flaky hardware seems highly improbable to me.
> 
> Or the 64m is fitting in cache most of the time, and the bad read was
> cached and thus repeated?

Just doing the hashing from rcgd0d leaves the disks active 100%. I think
dd from a raw device is not cached.


> > i.e. mismatch at the 3rd run. I seriously doubt that the 70+ successful runs
> > on the rcgd0d device were pure luck.
> 
> Please try some of the other variants I suggested.  Perhaps try
> varying the block size of the dd, too.  If these eliminate seeking,
> then the next possible culprit is probably the filesystem :-/.

Gonna do this right away.

Thanks and regards,

ND