Subject: MCHK exception in -current with MMU off
To: None <port-powerpc@netbsd.org>
From: Tim Kelly <hockey@dialectronics.com>
List: port-powerpc
Date: 03/24/2005 14:43:50
Timo Schoeler and I have been tracking down the cause of a repeatable MCHK
exception in -current. We have eliminated memory failure as the cause
through testing each memory stick individually and in different locations,
without affecting the panic. I state this because one possible cause of a
MCHK exception is bus parity error and/or memory failure. Due to the fact
that the same G4 will boot 2.99.9, we do not believe this is a hardware
failure.
Please refer to
http://mail-index.netbsd.org/port-macppc/2005/03/18/0004.html for some
additional information; I will only hit the highlights here.
We have isolated two of the three variations on the panic to the following:
(Altivec-enabled)
trap: pid 1.1 (init): kernel MCHK trap @ 0x5769a4 (SRR1=3D0x2041020)
5769a0: 38 00 00 00 li r0,0
5769a4: 7e 69 01 ce stvx v19,r9,r0
r0 must be 0; from show registers in ddb:
r9 0x810000 opcodes_base+0x3c
5769a4 is vzeropage+0x88 and appears to be the first pass of zeroing the
page with a single cache line:
__asm("stvx %2,%0,%1" :: "b"(pa), "r"( 0), "n"(ZERO_VEC));
(r9 is incremented with subsequent passes, so I infer this is the first pass=
)
(non-Altivec)
trap: pid 1.1 (init): kernel MCHK trap @ 0x55fc48 (SRR1=3D0x49000)
55fbec: 48 00 02 81 bl 55fe6c <curcpu>
55fbf0: 7c 69 1b 78 mr r9,r3
55fbf4: 80 09 01 10 lwz r0,272(r9)
55fbf8: 90 1f 00 10 stw r0,16(r31)
<snip>
55fc44: 80 1f 00 10 lwz r0,16(r31)
55fc48: 7c 09 00 f8 not r9,r0
55fc48 is pmap_syncicache+0x78 and appears to be:
len +=3D pa - (pa & ~linewidth);
from earlier in the code:
const size_t linewidth =3D curcpu()->ci_ci.icache_line_size;
so r0 is linewidth. The mnemonic "not" expands to nor rA,rS,rS so here
again I do not see r0 as the culprit.
=46rom the show registers:
r9 0x820000 linux_sysent+0x5cc
In both cases, the exception is occuring with MSR_DR (MMU) turned off, and
in both cases r9 holds the value pa. I believe the MCHK exception is
misleading, though, because 0x8xxxxx is in the memory the kernel file
physically occupies (objdump -h from the kernel and the last panic):
netbsd: file format elf32-powerpc
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00602bdc 00100000 00100000 00000060 2**4
CONTENTS, ALLOC, LOAD, CODE
1 .rodata 0010fc44 00702be0 00702be0 00602c40 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 link_set_domains 00000018 00812824 00812824 00712884 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 link_set_pools 00000158 0081283c 0081283c 0071289c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 link_set_sysctl_funcs 000000e0 00812994 00812994 007129f4 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 link_set_malloc_types 00000158 00812a74 00812a74 00712ad4 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
6 link_set_dkwedge_methods 00000004 00812bcc 00812bcc 00712c2c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
7 link_set_bufq_strats 0000000c 00812bd0 00812bd0 00712c30 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 link_set_evcnts 00000030 00812bdc 00812bdc 00712c3c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
9 .sdata2 00000000 00812c0c 00812c0c 00712c6c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
10 .data 000129b0 00812c10 00812c10 00712c70 2**3
CONTENTS, ALLOC, LOAD, DATA
11 .sdata 00000fd4 008255c0 008255c0 00725620 2**3
CONTENTS, ALLOC, LOAD, DATA
12 .sbss 00000ac8 00826598 00826598 007265f8 2**3
ALLOC
13 .bss 0003aff4 00827060 00827060 007265f8 2**3
ALLOC
14 .comment 000082c8 00000000 00000000 007265f8 2**0
CONTENTS, READONLY
15 .ident 0000ae48 00000000 00000000 0072e8c0 2**0
CONTENTS, READONLY
16 .note 00000028 00000000 00000000 00739708 2**2
CONTENTS, READONLY
As far as I can see, .data segment occupies 0x820000 and therefore is a
valid physical address, which eliminates an additional possible cause of
MCHK exceptions. Originally I thought that perhaps the cache was attempting
to write to a non-valid physical address with MMU off, as the Altivec panic
is with a write to the cache which may require something to be written to
memory to make room and this can cause a deferred MCHK exception, but I see
no specific reason the not instruction would also hit the cache as it
appears to be a purely register based operation. Unless the instructions
themselves are getting stored in the cache, I don't see why anything would
be moved out of the cache with a not (and I don't know why an instruction
cache would write to memory).
In both cases, the address being referred to is within the kernel and looks
like an aligned allocation of memory. The tr from the panics show this is
occuring during a fork, so it seems to me that something is amiss with
either the kernel or user VM, but I am not familiar with the specifics of
the implementation.
If anyone has any suggestions as to how to isolate this any further, please
let me know. If I have misinterpreted something, please let me know as well
so I can correct my misunderstanding.
The condition is reproducible so testing will show if it is fixed quickly;
however, due to a variety of constraints, time between tests is lengthy.
thanks in advance,
tim