Subject: Re: arm32 kernel crashes
To: None <port-arm32@netbsd.org>
From: David Forbes <dmf20@hermes.cam.ac.uk>
List: port-arm32
Date: 12/15/1998 14:26:17
I've now rebuilt a kernel on my CATS box with Charles Hannum's modified
debugger, and things ran smoothly until activity on the serial port caused
the crash again.  I've noticed that the point at which the crash occurs
(in number of characters exchanged on tty01), but the point in the code is
always the same. 

Fault with intr_depth > 0
Data abort: 'Translation fault (page)' status=007 address=effffffc
							PC=f0116a6c
Stopped in bash at irq_entry+0x88: ldr r2, [r7, r9, lsl#2]

In this particular case, a login was achieved and bash started.  But not
for long.

db> tr
_comstart
_ttstart
_ttwrite
_comwrite
_spec_write
_ufsspec_write
_vn_write
_dofilewrite
_sys_write
_syscall

This is as before.  (I've omittd the (_symbol +0x10) because they were all
the same.)

db> show registers
spsr 	0x40000093
r0 	0
r1	_intr_disabled_mask
r2	0xe28ff441
r3	0x80000013
r4	0x1
r5	0xf1152000
r6	0xf114cb00
r7	0xf0180248  (_spl_masks)
r8	0
r9	0xfff9ff6d
r10	0xf4000000
r11	0xf37v9d5c
r12	0x1
usr_sp	0xefbfd394
usr_lr	0x200f1eb4
svc_sp	0xf37b9cdc
svc_lr	_splx + 0x30
pc	irq_entry + 0x88
und_sp	0xf37b8ff0
abt_sp	0xf01bc000
irq_sp	0xf01bb000

Looking at these values, I'm not surprised that ldr r2, [r7, r9, lsl#2]
failed.  Anyway, attempting to continue, just repeats the original Data
abort error as many times as you like.

However, attempting to reboot:

db> reboot
boot: howto = 00000000 curproc = 0xf3787600
Warning IRQs disabled during boot()
syncing disks...22 21 10 done
Fault with intr_depth > 0
Data abort: 'Translation fault (page)' status = 007 address = 2004b330
		PC=f0110df0
Stopped in updateat _fetchuserword+0x30: ldr r0,[r0, #0x0000]

I'm assuming that this is related to the previous fault, so I haven't
noted the register values, etc.  Issuing another reboot does so instantly,
and the machine comes back up with wd0a not marked clean, but wd0e is.



The code in irq_entry that causes the original fault is in
footbridge/footbridge_irq.S, in the section concerned with finding the
highest IPL.

	mov 	r9, #(_SPL_LEVELS - 1)
	ldr 	r7, Lspl_masks

Lfind_highest_ipl:
	ldr	r2, [r7, r9, lsl #2]			* Fault here
	tst	r8, r2
	subeq	r9, r9, #1
	beq	Lfind_highest_ipl

Now, according to the register dump, r8 is zero.  Therefore TST r8, r2
will always set the Z flag and EQ will always be true?  Therefore, we keep
subtracting from r9 until we get a fault.  According to the code r8 should
be the current IRQ requests.

I'm presuming that because everything else appears to function normally
until an error occurs, that perhaps this code is not directly to blame?

Cheers,

David.

PS - I've rudely assumed in the above that db accounts for the pipeline
and the fault being given by and instruction further back in the code than
the one I've looked at.  In retrospect, this seems a rather dim
assumption...