Subject: Re: Memory Fault Reported By Kernel
To: Nick Boyce <nick@glimmer.demon.co.uk>
From: Uwe Lienig <Uwe.Lienig@fif.mw.htw-dresden.de>
List: port-pmax
Date: 09/22/2000 15:36:23
This is a multi-part message in MIME format.
--------------5773D6CDD8F09AE782673FE6
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Nick Boyce wrote:
> 
> I've just converted a DEC 5240 from Ultrix 4.4 to NetBSD 1.4.2, and in
> the first 24 hours we've had one occurrence of the following report in
> /var/log/messages :
> 
>  Sep 21 10:20:01 rccnx4 /netbsd: CPU memory read ECC error at
> 0x00270824
>  Sep 21 10:20:01 rccnx4 /netbsd:    ECC 0xd39cdd0c
> 
> The machine seems to be functioning happily despite this, and I assume
> that "ECC" means this was, um, a single-bit fail in error correcting
> memory and was therefore recovered - guessing madly here.  I also
> guess this sort of hardware-related behaviour is peculiar to the pmax
> kernel so I shouldn't ask about this anywhere but on this list ... is
> that about right ?

ECC - Error Correction Code - AFAIK
able to recover 1-bit errors and detect 2-bit errors.

> 
> I never noticed Ultrix reporting things like this, but maybe NetBSD is
> better with the hardware :-)

Ultrix is reporting this as well - but you have to use the Ultrix tool
<uerf> - AFAIK ultrix(unix) error report facility - this tool was able
to read the binary errorlog file under
/var/adm/syserr/syserr.<systemname>

I was running a 5000/200 for 10 years, even now I do have two 200 and
two 240 which I do plan to run under NetBSD. This error I noticed some
time and the error held the machine from running useful tasks.
Investigating this error I found, that the very big mem modules are
soldered into a connector. The connector had some tiny pins which
connect to the PCB of the mem modules. The plumb was broken and I
resoldered the pins. The errors went away.

After that I had sometimes very seldom this type of message. Washing the
connector may help ...

> 
> Can anyone tell me whether I'd be best advised to have our hardware
> support people swap the RAM out in this machine ?

If you have enough ... IMO that's not necessary. My 133 panices some
times with parity check error. I cleaned up the mem modules and the 133
ran another 2 years ...

One of the 200 reported once a month a corrected single bit error. But I
never had any problem with this ... 

> 
> If so, do the above reported addresses help me figure out which memory
> module is bad ?  (The machine has 32 Mb)
> 
> Cheers,
> 
> Nick Boyce
> Bristol, UK
> --
> If you try to fail, and succeed, which have you done?
--------------5773D6CDD8F09AE782673FE6
Content-Type: text/x-vcard; charset=us-ascii;
 name="Uwe.Lienig.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Uwe Lienig
Content-Disposition: attachment;
 filename="Uwe.Lienig.vcf"

begin:vcard 
n:Lienig;Uwe
tel;fax:(+49 351) 462 3476
tel;work:(+49 351) 462 2780
x-mozilla-html:FALSE
org:Forschungsinstitut Fahrzeugtechnik -FiF-;Computer Aided Design and Advanced Simulation Technology
adr:;;Friedrich-List-Platz 1;Dresden;Saxony;01069;Germany
version:2.1
email;internet:Uwe.Lienig@fif.mw.htw-dresden.de
note;quoted-printable:... have you ever seen MicroSoft users smiling ...=0D=0A=0D=0ASysAdmin 
x-mozilla-cpt:;-8576
fn:Uwe Lienig
end:vcard

--------------5773D6CDD8F09AE782673FE6--