current-users: odd memory corruption problems

Subject: odd memory corruption problems
To: None <current-users@netbsd.org>
From: Greg Troxel <gdt@ir.bbn.com>
List: current-users
Date: 08/02/2006 14:55:15
--=-=-=
Content-Transfer-Encoding: quoted-printable


I'm having a problem that I think is probably hardware, but I'm
posting anyway because there's some chance it is related to the
suspected pool corruption problems.

The machine is a P4-3400 with Intel 915 motherboard and 2 GB RAM.  I
have two Seagate 400 GB drives RAID-1 with raidframe.  I used to use
softdeps but turned them off when I started having trouble.

It is running fairly -current, now 3.99.19 from 5/7, and before
something pretty recent.  The machine was new in around March.

I am storing about 30 GB of digital photos.  I typically mount a /cf
with msdosfs, cp files to /home possibly renaming to get the 10000
digit (/home is on raid0) and then rsync to another machine.

The raid usually shuts down cleanly, but once the machine was powered
off uncleanly and fsck -p failed on reboot.  There were a lot of dup
blocks, and I lost several 100 MB files (that I was able to recover
From=20backups).  I think this was latent fs damage perhaps due to the
same corruption issue happening in an inode.

I noticed some corrupt pictures, and have traced this to some bytes
being wrong in cases where I could trace it; I'm quite confident this
is the problem in the other ones.

I immediately suspected memory, and ran memtest+ 1.65 overnight for 10
hours, and it found zero errors.

I made a list of all .jpg and .nef files under ~/PICTURES, and ran
xargs md5 on that list multiple times.   I found that I got different
output.  Two files were often different, and then there were larger
differences.

I mounted (ro) the underlying RAID-1 components, and ran xargs md5 on
those.  Two files were different on wd0 and wd1, and in both cases,
wd0 was right (by inspecting the picture to find the undamaged one).

In once case, here's the difference (~ 6 MB file).  This pattern is
typical of data errors.  This typically causes the picture to have a
color/lightness shift starting at some x coordinate and continuing on.

> cmp -l 29596-0.nef 29596-1.nef=20
2478609 143 343
2479121 137 337
2479249  22 222
2479633  61 261
2479761 163 363
2480657 173 373
2480785 147 347
2481169 127 327
2481425  44 244
2482065  50 250
2482097 102 302
> l 29596-0.nef 29596-1.nef=20
=2Dr--r--r--  1 gdt  wheel  5897374 Jun 29 12:19 29596-0.nef
=2Dr--r--r--  1 gdt  wheel  5897374 Jun 29 12:19 29596-1.nef

Other problems I have found are often similar, with the high bit set
in the bad file, and errors typically in a 4K chunk:

> dc
10k
2482097 4096/p
605.9807128906
2478609 4096/p
605.1291503906

One xargs md5 run on wd0 found 10 files different.  I'm aware of cache
consistency issues (write to raid doesn't update buffer cache for wd0
fs), but these had not been written since boot.

Right now I've pulled on of my 1 GB DIMMs, have recovered the two
files that were different in the RAID set, and am redoing md5.  So far
so good.

If that's ok, I'll try the other DIMM, and then both.  I realize this
could be memory or the power supply, or the processor/mobo.

But, since memtest+ says it is ok, and the machine seems stable except
for this, I wonder if something is scribbling on the buffer cache
occasionally.  It could just be that very rarely the memory reads back
the wrong bits, and it's only here that I noticed.

Any clues welcome, and I'll post again if I can identify bad hardware.

=2D-=20
    Greg Troxel <gdt@ir.bbn.com>

--=-=-=
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.4 (NetBSD)

iD8DBQFE0PUY+vesoDJhHiURAjMjAKCmWDu296cXl6vZjKzPIwxQ90G+ZQCeJCKu
jBV5oOGmAp2mJORKz3qj49o=
=4rSm
-----END PGP SIGNATURE-----
--=-=-=--