Subject: Re: Data corruption issues possibly involving cgd(4)
To: Nino Dehne , Daniel Carosone <dan@geek.com.au>
From: Daniel Carosone <dan@geek.com.au>
List: current-users
Date: 01/17/2007 07:44:33
--XOIedfhf+7KOe/yw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

> On Tue, Jan 16, 2007 at 08:00:14AM +0100, Nino Dehne wrote:
> After 50 runs of dd if=3D/dev/rcgd0d bs=3D65536 count=3D4096 | md5 and no=
 error
> I aborted the test. Replacing rcgd0d with cgd0a made no difference.

Interesting.

> While not necessary IMO, I tried the same with rraid1d, no errors either
> after 50 runs.=20

Which was actually the test I had in mind, to eliminate cgd(4) but
keep the rest the same.  Your test above seems to suggest the problem
is somewhere else than cgd alone, which is good.

> For comparison, a loop on the filesystem on the cgd aborted
> after the 14th run now.
>=20
> So the issue doesn't seem to be related to the power supply either and
> frankly, it's starting to freak me out.

I sympathise, but this is progress.  You've already done a number of
important things to eliminate certain causes, and now we're
eliminating more and narrowing in on the culprit.

On Tue, Jan 16, 2007 at 09:28:21AM +0000, David Laight wrote:
> The 'dd' will be doing sequential reads, whereas the fs version will be d=
oing
> considerable numbers of seeks.  It is the seeks that cause the disks to
> draw current bursts from the psu - so don't discount that.

And this is a most excellent and important point.  Could you try
repeating the test with one or more of these variations to force
seeking:

 two concurrent dd's, one with a large skip=3D to land elsewhere on the
 platters

 dd from raid and a concurrent fsck -n of the cgd filesystem

 multiple concurrent fsck -n's, to see if they ever report different
 errors.  -n is especially important here, both because of the
 concurrency and if they're going to find spurious errors

If this produces the problem, it's a great result, because combined
with your previous test it clearly isolates seeking and thus almost
certainly power as the problem.  You've done the test that eliminated
the seeks, now you need to add the seeks and eliminate cgd.  After
that, you might try the same test on all of the individual drives in
parallel, to eliminate the raid(4) software, if you really want to
prove the point.

If it doesn't produce the problem, I don't immediately see any other
culprits consistent with the data so far, and I might start getting a
little freaked out too... :-)

--
Dan.
--XOIedfhf+7KOe/yw
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (NetBSD)

iD8DBQFFrTkxEAVxvV4N66cRAlWFAJ9HbYz/CaDgtOhsKcY6FZYHUQ13zwCfdV3K
wnN/u1kr24HYQeT0bhP3mhE=
=4Nol
-----END PGP SIGNATURE-----

--XOIedfhf+7KOe/yw--