NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/42539: softint may fire on wrong cpu



>Number:         42539
>Category:       kern
>Synopsis:       softint may fire on wrong cpu
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Dec 29 09:40:00 +0000 2009
>Originator:     Martin Husemann
>Release:        NetBSD 5.99.22
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 5.99.22 NetBSD 5.99.22 (NELLY.MP) #96: Mon Dec 
28 22:26:09 CET 2009 
martin%emmas.aprisoft.de@localhost:/nelly/usr/src/sys/arch/sparc64/compile/NELLY.MP
 sparc64
Architecture: sparc64
Machine: sparc64
>Description:

I started seeing a few KASSERTs some time ago, from softint_execute():

        KASSERT(si->si_active);

The problem is pretty reliably reproducable for me, so I tried to find out
a bit more details. It turned out that the softint in question always was
"ser/1", the serial softint for cpu1. Due to the way interrupts are currently
handled in sparc64, there will never be a hard serial interrupt dispatched
to cpu1 (since cpu0 establishes the interrupt handler and we do not move
them). Now that explains why the softint is not marked as active, but how
could it fire?

I added some instrumentation and caught it like this:

panic: ci = 0x1814000, l = 0x1317c7c0, l->l_cpu = 0xd728000, l->l_ctxswtch = 1  

Stopped in pid 473.1 (xulrunner-bin) at netbsd:cpu_Debugger+0x4:        nop     
softint_trigger(1814000, 1317c7c0, 3c2120a, d6c6d58, 46842ba0, 800) at 
netbsd:softint_trigger+0x74
softint_schedule(978, 1, 48d38548, 4684bda0, 46a34d60, 46a34d38) at 
netbsd:softint_schedule+0xdc
zshard(1, 4, ffffffffffe00000, ffffffff, 0, 0) at netbsd:zshard+0x54
intr_list_handler(0, a, e0017ed0, ffffffffb72bd290, 106f220, 4ed281e8) at 
netbsd:intr_list_handler+0x14
sparc_interrupt(1, 11eb898, 48b, 11eb938, 0, ffffffffffff6088) at 
netbsd:sparc_interrupt+0x238
lwp_setlock(1317c7c0, e5f7420, 0, ce10dc0, 0, 1317c810) at 
netbsd:lwp_setlock+0x2c
mi_switch(1317c7c0, 11ee2b8, 16d, 11eba80, 4, ffffffffffff6040) at 
netbsd:mi_switch+0x1d8
preempt(1317c7c0, 4050737c, 1322de00, 0, 0, 1411710) at netbsd:preempt+0xa0
trap(1322ded0, fffffffffffffffe, 4050737c, 99820092, 46842ba0, 800) at 
netbsd:trap+0x770
?(0, 1, 48d38548, 4684bda0, 46a34d60, 46a34d38) at 0x1009070
db{0}> mach cpu                                                                 
cpu0: self 0x01814000 lwp 0x1317c7c0 pcb 0x1322a000 fplwp 0x1317c7c0            
cpu1: self 0x0d728000 lwp 0x0ce18bc0 pcb 0x0d712000 fplwp 0x00000000            
db{0}> ps                                                                       
PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT           
694      1 3   0        80            e338860             mpg123 aud_wr         
533      1 3   0        80           13252fa0             mpg123 select         
467      1 3   0        80           13253380       irc-20081115 select         
464      1 3   1        80           133bc800               rxvt select         
337      1 3   0        80           133bcbe0                ssh select         
534      1 3   0        80            e5f7040                 sh wait           
410      1 3   1        80           1317d740                ssh select         
473     35 3   1        80           133bc040      xulrunner-bin parked         
473     34 3   0        80           13252bc0      xulrunner-bin parked         
473     33 3   1        80           132527e0      xulrunner-bin parked         
473     18 3   1        80            e5f64a0      xulrunner-bin parked         
473     17 3   1        80           1317c3e0      xulrunner-bin parked         
473     13 3   0        80           133bcfc0      xulrunner-bin parked         
473  >   4 7   0         0            e5f7420      xulrunner-bin                
473      3 3   0        80           13252020      xulrunner-bin parked         
473      2 3   1        80           13252400      xulrunner-bin select         
473  >   1 7   1         0           1317c7c0      xulrunner-bin                
362      1 3   0        80            e5f7be0               qvwm select         
356      1 3   1        80           1317c000               tcsh ttyraw         
323      1 3   1        80            e5f60c0               tcsh pause          
97       1 3   0        80           1317db20             xclock select         
96       1 3   1        80           1317d360               rxvt select         
136      1 3   1        80           13253760               rxvt select         
134      1 3   1        80           13253b40              xload select         
105      1 3   1        80           1317cba0               qvwm select         
441      1 3   1        80           1317cf80               Xorg select         
461      1 3   0        80            e338c40              xinit wait           
381      1 3   1        80            e09fba0          ssh-agent select         
373      1 3   0        80            e5f6c60                 sh wait           
371      1 3   1        80            ce24040              getty ttyraw         
368      1 3   0        80            e09f7c0              getty ttyraw         
370      1 3   1        80            ce24800              getty ttyraw         
376      1 3   1        80            ce297a0              login wait           
357      1 3   0        80            e5f7800               cron nanoslp        
348      1 3   1        80            e3380a0              inetd kqueue         
300      1 3   1        80            e338480               sshd select         
287      1 3   0        80            e09f3e0             upsmon nanoslp        
303      1 3   0        80            e339bc0             upsmon piperd         
291      1 3   1        80            e3397e0               upsd select         
260      1 3   0        80            e09e080           apcsmart select         
266      1 3   1        80            e339400               ntpd pause          
247      1 3   0        80            e09ec20                lpd select         
238      1 3   0        80            e339020              mserv select         
116      1 3   1        80            e09f000            syslogd kqueue         
1        1 3   1        80            ce29b80               init wait           
0       41 3   1       200            e09e840          nfskqpoll nfskqpw        
0       40 3   1       200            e09e460            swapiod swapiod        
0       37 3   1       200            ce28060        vmem_rehash vmem_rehash    
0       36 3   0       200            ce28440           aiodoned aiodoned       
0       35 3   0       200            ce28820            ioflush syncer         
0       34 3   1       200            ce28c00           pgdaemon pgdaemon       
0       33 3   0       200            ce28fe0              nfsio nfsiod         
0       32 3   0       200            ce24be0              nfsio nfsiod         
0       31 3   0       200            ce24420              nfsio nfsiod         
0       30 3   1       200            ce24fc0              nfsio nfsiod         
0       29 3   1       200            ce293c0              unpgc unpgc          
0       20 3   0       200            ce253a0           scsibus0 sccomp         
0       19 3   1       200            ce25780            xcall/1 xcall          
0       18 1   1       200            ce25b60          softser/1                
0       17 1   1       200            ce18020          softclk/1                
0       16 1   1       200            ce18400          softbio/1                
0       15 1   1       200            ce187e0          softnet/1                
0       14 1   1       201            ce18bc0             idle/1                
0       13 3   0       200            ce18fa0         pmfsuspend pmfsuspend     
0       12 3   0       200            ce19380           pmfevent pmfevent       
0       11 3   0       200            ce19760           nfssilly nfssilly       
0       10 3   0       200            ce19b40            cachegc cachegc        
0        9 3   0       200            ce12000              vrele vrele          
0        8 3   0       200            ce123e0          modunload modunload      
0        7 3   0       200            ce127c0            xcall/0 xcall          
0        6 1   0       200            ce12ba0          softser/0                
0        5 1   0       200            ce12f80          softclk/0                
0        4 1   0       200            ce13360          softbio/0                
0        3 1   0       200            ce13740          softnet/0                
0        2 1   0       201            ce13b20             idle/0                
0        1 3   1       200            140f5e0            swapper uvm            
db{0}>                                                                          


As you can see we are interrupted (zshard, the serial hardware interrupt)
midway in mi_switch, the old lwp has already been marked l_ctxswtch.
We run softint_trigger() on cpu0 (that is curcpu(), the ci= value in the
panic message), but curlwp (the l in the panic message) is being switched 
and has l_cpu == cpu1. Since effectively the softint will fire on curlwp->l_cpu
we end up in the softint_execute for this softint on the wrong cpu.


>How-To-Repeat:

Use a multithreaded application (like firefox) with a serial mouse - at least
that pretty reliably triggers it on one of my machines (diskless with root
on NFS, audio playing, both may also be relevant).
Probably you need an arch that does not __HAVE_FAST_SOFTINTS.

>Fix:
Not sure, kill l_cpu in the migrated lwp earlier?
just use curcpu() in softint_trigger?
Or is it a MD bug and sparc64 code needs to do something I'm overlooking?



Home | Main Index | Thread Index | Old Index