Subject: Re: new snapshot available
To: None <eeh@netbsd.org, grant@grunta.com>
From: None <eeh@netbsd.org>
List: port-sparc64
Date: 08/28/2001 01:47:39
|
| On Mon, Aug 27, 2001 at 03:04:33PM -0000, eeh@netbsd.org wrote:
|
| > | output is below, prtconf -v is attached.
| > 
| > Actually, `prtconf -pv' is preferred since it prints out just the
| > firmware information rather than the driver information.
|
| Attached. :)
|
| > | elf64_exec: Booting /pci@1f,4000/network@1,1/netbsd.INSTALL
| > | 4560808@0x1000000+7316704@0x1800000Fast Data Access MMU Miss
| > 
| > This is interesting.  I wonder if you have issues w/the layout of memory
| > on that machine.
|
| Actually I did. The 256mb as shipped was in bank 0, and an additional
| 256mb in bank 1. This was causing Solaris to panic:
|
| WARNING: [AFT1] Uncorrectable Memory Error on CPU2 Data access at TL=0, errID 0x00000018.a2198f2f
|     AFSR 0x00000001<ME>.80200000<PRIV,UE> AFAR 0x00000000.afe0acc0
|     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1000c4cc
|     UDBH 0x020b<UE> UDBH.ESYND 0x0b UDBL 0x020b<UE> UDBL.ESYND 0x0b
|     UDBH Syndrome 0xb Memory Module U1001 U1002 U1003 U1004
| WARNING: [AFT1] Uncorrectable Memory Error on CPU2 Data access at TL=0, errID 0x00000018.a2198f2f
|     AFSR 0x00000001<ME>.80200000<PRIV,UE> AFAR 0x00000000.afe0acc0
|     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1000c4cc
|     UDBH 0x020b<UE> UDBH.ESYND 0x0b UDBL 0x020b<UE> UDBL.ESYND 0x0b
|     UDBL Syndrome 0xb Memory Module U1001 U1002 U1003 U1004
| panic[cpu2]/thread=70109e20: [AFT1] errID 0x00000018.a2198f2f UE Error(s)
|
| moving the memory out of bank 0 eliminated the problem (U1001,2,3,4).
| The additional memory is now in bank 1, original in bank 2, and Solaris is
| fine. Hopefully I will be able to get the original memory replaced and
| put it bank into bank 0.
|
| What I'm unclear about now is why 1.5.1's ofwboot.net got further
| booting the INSTALL kernel than mrg's snapshot:
|
| Rebooting with command: boot net netbsd.INSTALL                       
| Boot device: /pci@1f,4000/network@1,1  File and args: netbsd.INSTALL
| 15400 >> NetBSD/sparc64 OpenFirmware Boot, Revision 
| >> (martin@setting-sun.duskware.de, Thu Jun 28 20:24:14 CEST 2001)
| Using BOOTPARAMS protocol: ip address: 192.168.211.212, hostname: grb-test2
| root addr=192.168.211.70 path=/data2/NetBSD/sparc64/snapshot/20010821/root
| loadfile: reading header
| elf64_exec: Booting /pci@1f,4000/network@1,1/netbsd.INSTALL
| 4560808@0x1000000+7316704@0x1800000+504824@0x1efa4e0 
| symbols @ 0xfff4a280 74 start=0x1000000
| chain: calling OF_chain(800000, ef18, 1000000, fff83a80, 18)
| [ netbsd ELF symbol table not valid ]
| [ no symbol table formats found ]    
| Kernel size exceeds 4MB          
| Setting DTLB entry 00000000 01000000 data e0000000 af800074
| Setting DTLB entry 00000000 01800000 data e0000000 ae800076
| Setting ITLB entry 00000000 01000000 data e0000000 af800074
| Setting CPUINFO mappings...
| Setting TSB pointer 00000000 01f86000
| consinit()
| setting up stdin
| stdin instance = fff714e8
| setting up stdout
| stdout instance = fff71c48
| stdout package = f0061998
|
| SIR Reset
|
| Watchdog Reset
| Externally Initiated Reset
| {2} ok .trap-registers
| %TL:1 %TT:3 %TPC:f0051d88 %TnPC:f0051d8c 
| %TSTATE:15001600  %CWP:0 
|    %PSTATE:16 AG:0 IE:1 PRIV:1 AM:0 PEF:1 RED:0 MM:0 TLE:0 CLE:0 MG:0 IG:0 
|    %ASI:15  %CCR:0  XCC:nzvc   ICC:nzvc

Hm.  The software initiated reset is at address f0051d88 which is
inside the PROM.  Gads.  I hate dealing w/buggy PROMs.

O.K.  I don't know how involved you want to get in debugging this
but...

You can use the `<n> .window' command to display individual register
windows.  Start with `0 .window' and go up until it starts to complain.

Look at the contents of the i0 or o0 register.  One of those may be the
parameter passed to the firmware.  That register will point to a block
of memory that contains a variable length array of 64-bit parameters.  

Kernel text addresses should be in the range of 0x01000000 up to
0x01800000 for very large kernels, and kernel data is usually from
0x01800000 to 0x01d00000.  PROM addresses are usually from 
0xf0000000 to 0xf1000000.

If you see a pointer that looks like it's in the data segment, and that 
location is still mapped, you can dump the contents by `<addr> <len> dump'.
Look at the results.  The first parameter should be a pointer to a string,
which you can dump the same way.  

The rest of the arguments vary according to the command.  You can simply look
through src/sys/arch/sparc64/openfirm.c and src/sys/arch/sparc64/ofw_machdep.c
to see what each command is and what its parameters are.

Once we know what's dying we can try to figure out why.

Eduardo