Subject: Re: LFS and big files
To: Greg Troxel , Mihai Chelaru <kefren@netbsd.ro>
From: Gary Thorpe <gathorpe79@yahoo.com>
List: current-users
Date: 01/03/2007 22:25:06
--- Greg Troxel <gdt@ir.bbn.com> wrote:

> I won't dispute the 'test your memory' advice from others at all, but
> memory tests aren't fully adequate to find problems.  I've seen
> problems that are similar to yours and after removing 1 DIMM don't
> have them.  But I'm not convinced I understand what's going on.
> 
> My machine is a 3.4 GHz P4, and it had 2 1GB dimms.  It has two
> Seagate 400 GB SATA drives in RAID-1 with raidframe.  It always ran
> fine, and it survived memtest (forget which version) for an entire
> day.
> 
> But, I found that some image files (usually JPG, from a digital
> camera) were corrupt.  On comparing from a separate copy from the
> memory card, and from another machine, I found that some bits were
> different, usually contained to a 4K page, but occasionally in more
> than one page.  I further found that the two RAID-1 copies differed,
> sometimes with one of them being correct.  This can lead to md5/sha1
> returning a different value everytime the blocks leave the cache,
> since raid-1 reads can get filled from either disk.  (I have overlay
> filesystems that I mount read-only to be able to debug such things.)
> 
> My currrent working theory is that the memory I pulled is indeed bad,
> but that it takes the noise induced by heavy disk activity to provoke
> it.  So maybe the power supply is marginal, and the memory is less
> robust.  Or perhaps there's a raid bug with lots of memory, but
> there's no evidence to point to that.
> 
> Given that you are testing with a larger-than-memory file, md5 will
> reread from disk each time.  If it's different always, instead of
> having a dominant value, you have serious trouble.  Next time you get
> a bad distfile, mv it aside and then when it checks do 'cmp -l'.  I
> suspect you'll find that its somewhat off with a bad page, and stably
> off.
> 
> If memtest86+ says things are ok, I'd take out half the memory and
> then the other half, and see if the system behaves any better. 
> Please
> report back if you figure anything out.
> 
> -- 
>     Greg Troxel <gdt@ir.bbn.com>

Are you using ECC? Although consumer machines now have 1GB+, isn't ECC
really a must have for that much memory (random bit errors becoming
much more likely)? Would it make a difference at all?

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com