NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/42539: softint may fire on wrong cpu
>Number: 42539
>Category: kern
>Synopsis: softint may fire on wrong cpu
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Tue Dec 29 09:40:00 +0000 2009
>Originator: Martin Husemann
>Release: NetBSD 5.99.22
>Organization:
The NetBSD Foundation, Inc.
>Environment:
System: NetBSD nelly.aprisoft.de 5.99.22 NetBSD 5.99.22 (NELLY.MP) #96: Mon Dec
28 22:26:09 CET 2009
martin%emmas.aprisoft.de@localhost:/nelly/usr/src/sys/arch/sparc64/compile/NELLY.MP
sparc64
Architecture: sparc64
Machine: sparc64
>Description:
I started seeing a few KASSERTs some time ago, from softint_execute():
KASSERT(si->si_active);
The problem is pretty reliably reproducable for me, so I tried to find out
a bit more details. It turned out that the softint in question always was
"ser/1", the serial softint for cpu1. Due to the way interrupts are currently
handled in sparc64, there will never be a hard serial interrupt dispatched
to cpu1 (since cpu0 establishes the interrupt handler and we do not move
them). Now that explains why the softint is not marked as active, but how
could it fire?
I added some instrumentation and caught it like this:
panic: ci = 0x1814000, l = 0x1317c7c0, l->l_cpu = 0xd728000, l->l_ctxswtch = 1
Stopped in pid 473.1 (xulrunner-bin) at netbsd:cpu_Debugger+0x4: nop
softint_trigger(1814000, 1317c7c0, 3c2120a, d6c6d58, 46842ba0, 800) at
netbsd:softint_trigger+0x74
softint_schedule(978, 1, 48d38548, 4684bda0, 46a34d60, 46a34d38) at
netbsd:softint_schedule+0xdc
zshard(1, 4, ffffffffffe00000, ffffffff, 0, 0) at netbsd:zshard+0x54
intr_list_handler(0, a, e0017ed0, ffffffffb72bd290, 106f220, 4ed281e8) at
netbsd:intr_list_handler+0x14
sparc_interrupt(1, 11eb898, 48b, 11eb938, 0, ffffffffffff6088) at
netbsd:sparc_interrupt+0x238
lwp_setlock(1317c7c0, e5f7420, 0, ce10dc0, 0, 1317c810) at
netbsd:lwp_setlock+0x2c
mi_switch(1317c7c0, 11ee2b8, 16d, 11eba80, 4, ffffffffffff6040) at
netbsd:mi_switch+0x1d8
preempt(1317c7c0, 4050737c, 1322de00, 0, 0, 1411710) at netbsd:preempt+0xa0
trap(1322ded0, fffffffffffffffe, 4050737c, 99820092, 46842ba0, 800) at
netbsd:trap+0x770
?(0, 1, 48d38548, 4684bda0, 46a34d60, 46a34d38) at 0x1009070
db{0}> mach cpu
cpu0: self 0x01814000 lwp 0x1317c7c0 pcb 0x1322a000 fplwp 0x1317c7c0
cpu1: self 0x0d728000 lwp 0x0ce18bc0 pcb 0x0d712000 fplwp 0x00000000
db{0}> ps
PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
694 1 3 0 80 e338860 mpg123 aud_wr
533 1 3 0 80 13252fa0 mpg123 select
467 1 3 0 80 13253380 irc-20081115 select
464 1 3 1 80 133bc800 rxvt select
337 1 3 0 80 133bcbe0 ssh select
534 1 3 0 80 e5f7040 sh wait
410 1 3 1 80 1317d740 ssh select
473 35 3 1 80 133bc040 xulrunner-bin parked
473 34 3 0 80 13252bc0 xulrunner-bin parked
473 33 3 1 80 132527e0 xulrunner-bin parked
473 18 3 1 80 e5f64a0 xulrunner-bin parked
473 17 3 1 80 1317c3e0 xulrunner-bin parked
473 13 3 0 80 133bcfc0 xulrunner-bin parked
473 > 4 7 0 0 e5f7420 xulrunner-bin
473 3 3 0 80 13252020 xulrunner-bin parked
473 2 3 1 80 13252400 xulrunner-bin select
473 > 1 7 1 0 1317c7c0 xulrunner-bin
362 1 3 0 80 e5f7be0 qvwm select
356 1 3 1 80 1317c000 tcsh ttyraw
323 1 3 1 80 e5f60c0 tcsh pause
97 1 3 0 80 1317db20 xclock select
96 1 3 1 80 1317d360 rxvt select
136 1 3 1 80 13253760 rxvt select
134 1 3 1 80 13253b40 xload select
105 1 3 1 80 1317cba0 qvwm select
441 1 3 1 80 1317cf80 Xorg select
461 1 3 0 80 e338c40 xinit wait
381 1 3 1 80 e09fba0 ssh-agent select
373 1 3 0 80 e5f6c60 sh wait
371 1 3 1 80 ce24040 getty ttyraw
368 1 3 0 80 e09f7c0 getty ttyraw
370 1 3 1 80 ce24800 getty ttyraw
376 1 3 1 80 ce297a0 login wait
357 1 3 0 80 e5f7800 cron nanoslp
348 1 3 1 80 e3380a0 inetd kqueue
300 1 3 1 80 e338480 sshd select
287 1 3 0 80 e09f3e0 upsmon nanoslp
303 1 3 0 80 e339bc0 upsmon piperd
291 1 3 1 80 e3397e0 upsd select
260 1 3 0 80 e09e080 apcsmart select
266 1 3 1 80 e339400 ntpd pause
247 1 3 0 80 e09ec20 lpd select
238 1 3 0 80 e339020 mserv select
116 1 3 1 80 e09f000 syslogd kqueue
1 1 3 1 80 ce29b80 init wait
0 41 3 1 200 e09e840 nfskqpoll nfskqpw
0 40 3 1 200 e09e460 swapiod swapiod
0 37 3 1 200 ce28060 vmem_rehash vmem_rehash
0 36 3 0 200 ce28440 aiodoned aiodoned
0 35 3 0 200 ce28820 ioflush syncer
0 34 3 1 200 ce28c00 pgdaemon pgdaemon
0 33 3 0 200 ce28fe0 nfsio nfsiod
0 32 3 0 200 ce24be0 nfsio nfsiod
0 31 3 0 200 ce24420 nfsio nfsiod
0 30 3 1 200 ce24fc0 nfsio nfsiod
0 29 3 1 200 ce293c0 unpgc unpgc
0 20 3 0 200 ce253a0 scsibus0 sccomp
0 19 3 1 200 ce25780 xcall/1 xcall
0 18 1 1 200 ce25b60 softser/1
0 17 1 1 200 ce18020 softclk/1
0 16 1 1 200 ce18400 softbio/1
0 15 1 1 200 ce187e0 softnet/1
0 14 1 1 201 ce18bc0 idle/1
0 13 3 0 200 ce18fa0 pmfsuspend pmfsuspend
0 12 3 0 200 ce19380 pmfevent pmfevent
0 11 3 0 200 ce19760 nfssilly nfssilly
0 10 3 0 200 ce19b40 cachegc cachegc
0 9 3 0 200 ce12000 vrele vrele
0 8 3 0 200 ce123e0 modunload modunload
0 7 3 0 200 ce127c0 xcall/0 xcall
0 6 1 0 200 ce12ba0 softser/0
0 5 1 0 200 ce12f80 softclk/0
0 4 1 0 200 ce13360 softbio/0
0 3 1 0 200 ce13740 softnet/0
0 2 1 0 201 ce13b20 idle/0
0 1 3 1 200 140f5e0 swapper uvm
db{0}>
As you can see we are interrupted (zshard, the serial hardware interrupt)
midway in mi_switch, the old lwp has already been marked l_ctxswtch.
We run softint_trigger() on cpu0 (that is curcpu(), the ci= value in the
panic message), but curlwp (the l in the panic message) is being switched
and has l_cpu == cpu1. Since effectively the softint will fire on curlwp->l_cpu
we end up in the softint_execute for this softint on the wrong cpu.
>How-To-Repeat:
Use a multithreaded application (like firefox) with a serial mouse - at least
that pretty reliably triggers it on one of my machines (diskless with root
on NFS, audio playing, both may also be relevant).
Probably you need an arch that does not __HAVE_FAST_SOFTINTS.
>Fix:
Not sure, kill l_cpu in the migrated lwp earlier?
just use curcpu() in softint_trigger?
Or is it a MD bug and sparc64 code needs to do something I'm overlooking?
Home |
Main Index |
Thread Index |
Old Index