Subject: Re: LFS and big files
To: Mihai Chelaru <kefren@netbsd.ro>
From: Greg Troxel <gdt@ir.bbn.com>
List: current-users
Date: 01/03/2007 19:01:26
I won't dispute the 'test your memory' advice from others at all, but
memory tests aren't fully adequate to find problems.  I've seen
problems that are similar to yours and after removing 1 DIMM don't
have them.  But I'm not convinced I understand what's going on.

My machine is a 3.4 GHz P4, and it had 2 1GB dimms.  It has two
Seagate 400 GB SATA drives in RAID-1 with raidframe.  It always ran
fine, and it survived memtest (forget which version) for an entire
day.

But, I found that some image files (usually JPG, from a digital
camera) were corrupt.  On comparing from a separate copy from the
memory card, and from another machine, I found that some bits were
different, usually contained to a 4K page, but occasionally in more
than one page.  I further found that the two RAID-1 copies differed,
sometimes with one of them being correct.  This can lead to md5/sha1
returning a different value everytime the blocks leave the cache,
since raid-1 reads can get filled from either disk.  (I have overlay
filesystems that I mount read-only to be able to debug such things.)

My currrent working theory is that the memory I pulled is indeed bad,
but that it takes the noise induced by heavy disk activity to provoke
it.  So maybe the power supply is marginal, and the memory is less
robust.  Or perhaps there's a raid bug with lots of memory, but
there's no evidence to point to that.

Given that you are testing with a larger-than-memory file, md5 will
reread from disk each time.  If it's different always, instead of
having a dominant value, you have serious trouble.  Next time you get
a bad distfile, mv it aside and then when it checks do 'cmp -l'.  I
suspect you'll find that its somewhat off with a bad page, and stably
off.

If memtest86+ says things are ok, I'd take out half the memory and
then the other half, and see if the system behaves any better.  Please
report back if you figure anything out.

-- 
    Greg Troxel <gdt@ir.bbn.com>