Subject: What is wrong in -current?
To: None <port-sparc64@NetBSD.org>
From: Martin Husemann <martin@duskware.de>
List: port-sparc64
Date: 05/19/2007 13:02:13
Hi folks,

here are some details of what goes wrong in -current. This could be a bug
introdcued with the idle-lwp changes, but the code looks correct; it could
also be an old bug uncovered by the changes. Any suggestions about the
cause are greatly appreciated ;-)

Here is a try to boot -current with some instrumentation added:

...
wskbd0 at kbd0 mux 1            
pmap_deactivate(0x184d160)
cpu_switchto() switching from curlwp 0x184d160 to 0xdac3b80 (saving to 0x184d160), sp 0xda3d621, pc 0x100aa3c
new curpcb: 0xda3a000        
pmap_activate(0xdac3b80)
pmap_activate: this is a kernel pmap, no need to activate
pmap_deactivate(0xdac3b80)                               
cpu_switchto() switching from curlwp 0xdac3b80 to 0x184d160 (saving to 0xdac3b80), sp 0xe003ec21, pc 0x11458ec
new curpcb: 0x1c02000         
cpu_switchto() returned prevlwp 0xdac3b80
pmap_activate(0x184d160)                 
pmap_activate: this is a kernel pmap, no need to activate
pmap_deactivate(0x184d160)                               
cpu_switchto() switching from curlwp 0x184d160 to 0xdac3b80 (saving to 0x184d160), sp 0xda3d491, pc 0x11458ec
new curpcb: 0xda3a000        
cpu_switchto() returned prevlwp 0x184d160
pmap_activate(0xdac3b80)                 
pmap_activate: this is a kernel pmap, no need to activate
data_access_fault: type 0x30                             
Trapframe 0xda3ded0:    tstate: 0       pc: 0   npc: 112de40
fault: 0x0      y: 0    pil: 0  oldpil: 0       tt: 30  Globals:
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
outs:                                                              
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
SIG_DFL: send signal 11 to lwp 0xdac3b80 proc 0x184ce60            
cpu0: kdb breakpoint at 120a7c0                        
Stopped in pid 0.2 (system) at  netbsd:cpu_Debugger+0x4:        nop
db> bt                                                             
lwp_userret(dac3b80, 0, 184ce60, 184d000, da3ddc8, 1848fe0) at netbsd:lwp_userre
t+0x144                                                                        
data_access_fault(da3ded0, 30, 0, 0, 1815000, 110819) at netbsd:data_access_faul
t+0x7ec                                                                        
?(0, 0, 0, 0, 0, 0) at 0x1008bb4
db> x/i 0x112de40               
netbsd:idle_loop:       save            %sp, -0xc0, %sp

This all happens when the kernel sleeps for the first time (keyboard
controller waiting for reset). We see switches between lwp0 (0x184d160) and
the idle lwp (0xdac3b80). Everything looks well, then suddenly we end up in a
data fault with completely bogus trapframe, and apparently bogus trap pc and
fault address (as if the trap register stack had no data for us at all) and
due to that bogus trap frame try to return to userland(!).

Depending on instrumentation details we have seen variants of this, but all
ended up trying to return to userland within a system lwp.

Martin