current-users: Re: Data corruption issues, probably involving ffs2 and >1Tb

Subject: Re: Data corruption issues, probably involving ffs2 and >1Tb
To: Nino Dehne <ndehne@gmail.com>
From: Daniel Carosone <dan@geek.com.au>
List: current-users
Date: 01/22/2007 08:45:19
--XsQoSWH+UP9D9v3l
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

/*=20
 * tech-kern@ added and subject changed, in the hopes of recruiting
 * some ffs-expert help. After some comprehensive testing and
 * elimination this is looking very much like a ffsv2 bug to me.
 *=20
 * Quick background, more details available in the current-users
 * archives:
 *  - ~1.1Tb ffs2 on cgd on raidframe R5 on 5x wd(4) NetBSD 3.x
 *  - reproducible occasional data corruption reading files=20
 *  - not reproducible reading wd, raid, or cgd devices=20
 *  - hardware, memory, power, etc pretty well eliminated
 */

On Thu, Jan 18, 2007 at 09:21:25AM +0100, Nino Dehne wrote:
> On Wed, Jan 17, 2007 at 11:58:56PM +0100, Nino Dehne wrote:
> > On Thu, Jan 18, 2007 at 07:31:47AM +1100, Daniel Carosone wrote:
> > > Nino, are you running a kernel with DIAGNOSTIC and/or DEBUG?  Looking
> > > at the cgd panic you found, I'm guessing not, because the path we see
> > > to that problem would have involved one or more likely DIAGNOSTIC
> > > messages.
> >=20
> > Not yet, but that just went on my list of things to try.
>=20
> I'm now running the system with those options. I didn't try to provoke
> the cgd panic yet, though. Parity recalculation is a lengthy process.

Sure.  While a test run of that when you get a chance could be helpful
to confirm the specific diagnosis of that problem, it's a separate
issue from your data corruption.

> > 1) Boot DIAGNOSTIC+DEBUG kernel
> > 2) Run fsck -f[1]
>=20
> I ran fsck -fn 10 times in a row, with 4 gzips running concurrently.
> Nothing. Output looked like this each time:
>=20
> ** /dev/rcgd0a (NO WRITE)
> ** File system is already clean
> ** Last Mounted on /home
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> 678270 files, 138286777 used, 8460366 free (8334 frags, 1056504 blocks, 0=
=2E0% fragmentation)

The only real explanation is that live access via the filesystem code
is causing the problem; every other path at every other layer below
that has been unable to provoke the issue, and at least fsck isn't
recognising any on-disk corruption.

> Then I ran cmp -l /var/tmp/<good file> /var/tmp/<bad file>:
>=20
> 503124993 246 310
> 503124994 132 251
> 503124995 230 221
> 503124996 211 351
> 503124997  51  46
> 503124998 214 173
> 503124999 374 122
> 503125000 144 331
> 503125001 134 141
> 503125002 150 336
> 503125003  46 247
> 503125004 266 153
> 503125217 257 211
> 503125218 303 217
> 503125219 111  14
> 503125220  70 227
> 503125221   2 316
> 503125222 343 340
> 503125223 207 372
> 503125224 350 210
> 503125229 100  67
> 503125230  64 145
> 503125231 262 327
> 503125232 205 146
>=20
> Another run of the script got me another sample. cmp -l:
>=20
> 502883433 167 363
> 502883434 141 126
> 502883435  26  11
> 502883436 311  67
> 502883437  25 153
> 502883438 302 103
> 502883439 145  40
> 502883440 103  71
> 502883445 346 174
> 502883446  45  60
> 502883447 333 262
>=20
> I managed to get both samples with under 20 runs of the script.

It's a small sample, but the coincidence of the offset and short range
at which the corruption occurs are rather interesting.  Especially if
this pattern is repeated, and for different large source files, its
continuing to strongly suggest filesystem issues to me.

Can you try to provoke the problem with a file a little smaller than
this?  Perhaps also with files much larger, to see if there's ever
corruption further along than this. My off-hand guess is that this is
near a boundary where the next level of indirection blocks kicks in.

The next thing to try, if you can, is a -current (or 4-beta) kernel to
see if any filesystem fixes since 3.x have been missed. =20

Is there an ffs doctor in the house?  This is reminding me more and
more of problems der Mouse reported seeing some time ago.

> As a wild guess, I resolved all IRQ conflicts on the machine.=20
> [..]
> Both steps helped nothing to resolve the issue.

These were unlikely at this point, but thanks for going to the effort
of eliminating them.

--
Dan.
--XsQoSWH+UP9D9v3l
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (NetBSD)

iD8DBQFFs97vEAVxvV4N66cRAsUVAKDa5iF6kFFZaxoOGUPuJzMD5wRf7wCgkoOl
73yrikYvKGaQg/hsW2ZzzUw=
=w+PY
-----END PGP SIGNATURE-----

--XsQoSWH+UP9D9v3l--