Subject: Re: cpu memory read ECC error?
To: Albrecht Gebhardt <albrecht.gebhardt@uni-klu.ac.at>
From: Chris Tribo <ctribo@del.net>
List: port-pmax
Date: 06/10/2001 18:22:31
on 6/10/01 2:24 PM, Albrecht Gebhardt at albrecht.gebhardt@uni-klu.ac.at
wrote:

> So what's this? I guess a harrware error? should I remove some parts from
> the memory? Is there some diagnostic tool to find the maliciuos chip
> (from the PROM console)?
> The error messages after logging in contained something like "module
> 2" ...

    It probably means the ECC and the actual contents of RAM are disagreeing
on two or more bits. A single bit error will be corrected, but it'll still
write an error to the console. Double bit or burst ECC errors will panic the
system. This can also happen if you have mixed sized memory modules in your
system, except for the very last slot (15?), can be an 8MB module if you
have 32MB modules in the rest of the slots. >> cnfg 3 will report the size
of the RAM. If you have a PrestoServ module installed, it must be in slot
15. Use >> t 3/ram to run the RAM test. There is also a -l option I think to
run it continuously.
    The RAM test will report the module(s) #'s that are failing. Then you
can either remove them and fill it in with another module, or try cleaning
the connectors with the procedure mentioned here and see if you get lucky:
http://mail-index.netbsd.org/port-pmax/2000/07/24/0005.html
    If you still can't get the self test to pass, you probably have a truely
bad RAM chip and not just poor connection(s). I suppose it would be possible
to figure out which chip was bad with the information from the PROM RAM
test, but you'd be hard pressed to unsolder an individual chip and then
re-solder it with one of the ~90 chip per board modules. Good luck on that.

> There was no such error during the (full and lengthy) installation from
> CD! 

    The install RAMdisk uses only about the first 3 to 6MB of RAM, chances
are it never ran into it.
 
> The harddisk is a Seagate 2.4GB (very old, could also be the source of the
> error?) 

    Nope, that would be a SCSI error, not an ECC error. And 2.4GB doesn't
compare to the ancient Conner RZ22 I have. (44MB)


    Chris