current-users: Re: Data corruption issues possibly involving cgd(4)

Subject: Re: Data corruption issues possibly involving cgd(4)
To: Nino Dehne <ndehne@gmail.com>
From: Thilo Jeremias <jeremias@optushome.com.au>
List: current-users
Date: 01/17/2007 23:30:55

is  the changed checksum  always deterministicly the same?
Meaning is this a systematic error, or
(Where I would guess for drive/cable/power etc problems) is it always a 
different checksum (I mean are there more than two checksums)

If it is deterministic, it probably just happens at a certain block, so 
it might help then  to isolate the location where the fault is
to find the cause


-- 
my 5 cts'

good luck

thilo


Nino Dehne wrote:
> On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
>   
>> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>>     
>>>> considerable numbers of seeks.  It is the seeks that cause the disks to
>>>> draw current bursts from the psu - so don't discount that.
>>>>         
>>> Good point. To accommodate to that, I repeatedly cat'ed the test file on the
>>> cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
>>> in a loop. I used 64M instead of 256M because the disk thrashing was really
>>> bad. I also set the CPU frequency to its maximum to maximize the power the
>>> system draws.
>>>       
>> a cpu-hog process would help here too..
>>     
>
> While doing the above, the CPU is about 0%-8% idle. I'm still running a
> UP kernel.
>
>
>   
>>> I attribute the checksum change to changes on the filesystem, since that was
>>> obviously mounted while doing the test. 
>>>       
>> Probably, yeah; I gave some suggestions for ways to avoid this a
>> moment ago, too.
>>     
>
> I'll have a look. Your other mail just arrived due to connectivity problems
> earlier.
>
>
>   
>>> Getting over 70 equal checksums and then 3 equal other checksums in
>>> a row with flaky hardware seems highly improbable to me.
>>>       
>> Or the 64m is fitting in cache most of the time, and the bad read was
>> cached and thus repeated?
>>     
>
> Just doing the hashing from rcgd0d leaves the disks active 100%. I think
> dd from a raw device is not cached.
>
>
>   
>>> i.e. mismatch at the 3rd run. I seriously doubt that the 70+ successful runs
>>> on the rcgd0d device were pure luck.
>>>       
>> Please try some of the other variants I suggested.  Perhaps try
>> varying the block size of the dd, too.  If these eliminate seeking,
>> then the next possible culprit is probably the filesystem :-/.
>>     
>
> Gonna do this right away.
>
> Thanks and regards,
>
> ND
>