Subject: MCHK exception in -current with MMU off
To: None <port-powerpc@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-powerpc
Date: 03/24/2005 14:43:50
Timo Schoeler and I have been tracking down the cause of a repeatable MCHK
exception in -current. We have eliminated memory failure as the cause
through testing each memory stick individually and in different locations,
without affecting the panic. I state this because one possible cause of a
MCHK exception is bus parity error and/or memory failure. Due to the fact
that the same G4 will boot 2.99.9, we do not believe this is a hardware
failure.

Please refer to
http://mail-index.netbsd.org/port-macppc/2005/03/18/0004.html for some
additional information; I will only hit the highlights here.

We have isolated two of the three variations on the panic to the following:

(Altivec-enabled)
trap: pid 1.1 (init): kernel MCHK trap @ 0x5769a4 (SRR1=3D0x2041020)
5769a0:	38 00 00 00 	li	r0,0
5769a4:	7e 69 01 ce 	stvx	v19,r9,r0

r0 must be 0; from show registers in ddb:

r9          0x810000    opcodes_base+0x3c


5769a4 is vzeropage+0x88 and appears to be the first pass of zeroing the
page with a single cache line:

__asm("stvx %2,%0,%1" ::  "b"(pa), "r"( 0), "n"(ZERO_VEC));

(r9 is incremented with subsequent passes, so I infer this is the first pass=
)



(non-Altivec)
trap: pid 1.1 (init): kernel MCHK trap @ 0x55fc48 (SRR1=3D0x49000)

55fbec:	48 00 02 81 	bl      55fe6c <curcpu>
55fbf0:	7c 69 1b 78 	mr      r9,r3
55fbf4:	80 09 01 10 	lwz     r0,272(r9)
55fbf8:	90 1f 00 10 	stw     r0,16(r31)

<snip>

55fc44:	80 1f 00 10 	lwz     r0,16(r31)
55fc48:	7c 09 00 f8 	not     r9,r0

55fc48 is pmap_syncicache+0x78 and appears to be:

len +=3D pa - (pa & ~linewidth);

from earlier in the code:
const size_t linewidth =3D curcpu()->ci_ci.icache_line_size;

so r0 is linewidth. The mnemonic "not" expands to nor rA,rS,rS so here
again I do not see r0 as the culprit.

=46rom the show registers:

r9          0x820000    linux_sysent+0x5cc


In both cases, the exception is occuring with MSR_DR (MMU) turned off, and
in both cases r9 holds the value pa. I believe the MCHK exception is
misleading, though, because 0x8xxxxx is in the memory the kernel file
physically occupies (objdump -h from the kernel and the last panic):

netbsd:     file format elf32-powerpc

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         00602bdc  00100000  00100000  00000060  2**4
                  CONTENTS, ALLOC, LOAD, CODE
  1 .rodata       0010fc44  00702be0  00702be0  00602c40  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 link_set_domains 00000018  00812824  00812824  00712884  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 link_set_pools 00000158  0081283c  0081283c  0071289c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  4 link_set_sysctl_funcs 000000e0  00812994  00812994  007129f4  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  5 link_set_malloc_types 00000158  00812a74  00812a74  00712ad4  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  6 link_set_dkwedge_methods 00000004  00812bcc  00812bcc  00712c2c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 link_set_bufq_strats 0000000c  00812bd0  00812bd0  00712c30  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  8 link_set_evcnts 00000030  00812bdc  00812bdc  00712c3c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  9 .sdata2       00000000  00812c0c  00812c0c  00712c6c  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
 10 .data         000129b0  00812c10  00812c10  00712c70  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 11 .sdata        00000fd4  008255c0  008255c0  00725620  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 12 .sbss         00000ac8  00826598  00826598  007265f8  2**3
                  ALLOC
 13 .bss          0003aff4  00827060  00827060  007265f8  2**3
                  ALLOC
 14 .comment      000082c8  00000000  00000000  007265f8  2**0
                  CONTENTS, READONLY
 15 .ident        0000ae48  00000000  00000000  0072e8c0  2**0
                  CONTENTS, READONLY
 16 .note         00000028  00000000  00000000  00739708  2**2
                  CONTENTS, READONLY

As far as I can see, .data segment occupies 0x820000 and therefore is a
valid physical address, which eliminates an additional possible cause of
MCHK exceptions. Originally I thought that perhaps the cache was attempting
to write to a non-valid physical address with MMU off, as the Altivec panic
is with a write to the cache which may require something to be written to
memory to make room and this can cause a deferred MCHK exception, but I see
no specific reason the not instruction would also hit the cache as it
appears to be a purely register based operation. Unless the instructions
themselves are getting stored in the cache, I don't see why anything would
be moved out of the cache with a not (and I don't know why an instruction
cache would write to memory).

In both cases, the address being referred to is within the kernel and looks
like an aligned allocation of memory. The tr from the panics show this is
occuring during a fork, so it seems to me that something is amiss with
either the kernel or user VM, but I am not familiar with the specifics of
the implementation.

If anyone has any suggestions as to how to isolate this any further, please
let me know. If I have misinterpreted something, please let me know as well
so I can correct my misunderstanding.

The condition is reproducible so testing will show if it is fixed quickly;
however, due to a variety of constraints, time between tests is lengthy.

thanks in advance,
tim