, Mihai Chelaru <kefren@netbsd.ro>
From: Gary Thorpe <gathorpe79@yahoo.com>
List: current-users
Date: 01/03/2007 22:25:06
--- Greg Troxel <gdt@ir.bbn.com> wrote:
> I won't dispute the 'test your memory' advice from others at all, but
> memory tests aren't fully adequate to find problems. I've seen
> problems that are similar to yours and after removing 1 DIMM don't
> have them. But I'm not convinced I understand what's going on.
>
> My machine is a 3.4 GHz P4, and it had 2 1GB dimms. It has two
> Seagate 400 GB SATA drives in RAID-1 with raidframe. It always ran
> fine, and it survived memtest (forget which version) for an entire
> day.
>
> But, I found that some image files (usually JPG, from a digital
> camera) were corrupt. On comparing from a separate copy from the
> memory card, and from another machine, I found that some bits were
> different, usually contained to a 4K page, but occasionally in more
> than one page. I further found that the two RAID-1 copies differed,
> sometimes with one of them being correct. This can lead to md5/sha1
> returning a different value everytime the blocks leave the cache,
> since raid-1 reads can get filled from either disk. (I have overlay
> filesystems that I mount read-only to be able to debug such things.)
>
> My currrent working theory is that the memory I pulled is indeed bad,
> but that it takes the noise induced by heavy disk activity to provoke
> it. So maybe the power supply is marginal, and the memory is less
> robust. Or perhaps there's a raid bug with lots of memory, but
> there's no evidence to point to that.
>
> Given that you are testing with a larger-than-memory file, md5 will
> reread from disk each time. If it's different always, instead of
> having a dominant value, you have serious trouble. Next time you get
> a bad distfile, mv it aside and then when it checks do 'cmp -l'. I
> suspect you'll find that its somewhat off with a bad page, and stably
> off.
>
> If memtest86+ says things are ok, I'd take out half the memory and
> then the other half, and see if the system behaves any better.
> Please
> report back if you figure anything out.
>
> --
> Greg Troxel <gdt@ir.bbn.com>
Are you using ECC? Although consumer machines now have 1GB+, isn't ECC
really a must have for that much memory (random bit errors becoming
much more likely)? Would it make a difference at all?
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com