Subject: Re: Data corruption issues possibly involving cgd(4)
To: Nino Dehne <ndehne@gmail.com>
From: Thilo Jeremias <jeremias@optushome.com.au>
List: current-users
Date: 01/17/2007 23:30:55
is the changed checksum always deterministicly the same?
Meaning is this a systematic error, or
(Where I would guess for drive/cable/power etc problems) is it always a
different checksum (I mean are there more than two checksums)
If it is deterministic, it probably just happens at a certain block, so
it might help then to isolate the location where the fault is
to find the cause
--
my 5 cts'
good luck
thilo
Nino Dehne wrote:
> On Wed, Jan 17, 2007 at 08:32:50AM +1100, Daniel Carosone wrote:
>
>> On Tue, Jan 16, 2007 at 08:49:14PM +0100, Nino Dehne wrote:
>>
>>>> considerable numbers of seeks. It is the seeks that cause the disks to
>>>> draw current bursts from the psu - so don't discount that.
>>>>
>>> Good point. To accommodate to that, I repeatedly cat'ed the test file on the
>>> cgd partition to /dev/null. At the same time, I hashed the first 64M of rcgd0d
>>> in a loop. I used 64M instead of 256M because the disk thrashing was really
>>> bad. I also set the CPU frequency to its maximum to maximize the power the
>>> system draws.
>>>
>> a cpu-hog process would help here too..
>>
>
> While doing the above, the CPU is about 0%-8% idle. I'm still running a
> UP kernel.
>
>
>
>>> I attribute the checksum change to changes on the filesystem, since that was
>>> obviously mounted while doing the test.
>>>
>> Probably, yeah; I gave some suggestions for ways to avoid this a
>> moment ago, too.
>>
>
> I'll have a look. Your other mail just arrived due to connectivity problems
> earlier.
>
>
>
>>> Getting over 70 equal checksums and then 3 equal other checksums in
>>> a row with flaky hardware seems highly improbable to me.
>>>
>> Or the 64m is fitting in cache most of the time, and the bad read was
>> cached and thus repeated?
>>
>
> Just doing the hashing from rcgd0d leaves the disks active 100%. I think
> dd from a raw device is not cached.
>
>
>
>>> i.e. mismatch at the 3rd run. I seriously doubt that the 70+ successful runs
>>> on the rcgd0d device were pure luck.
>>>
>> Please try some of the other variants I suggested. Perhaps try
>> varying the block size of the dd, too. If these eliminate seeking,
>> then the next possible culprit is probably the filesystem :-/.
>>
>
> Gonna do this right away.
>
> Thanks and regards,
>
> ND
>