Subject: kern/25536: SMP kernel crash (thread related ?)
To: None <gnats-bugs@gnats.NetBSD.org>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-bugs
Date: 05/11/2004 12:45:51
>Number:         25536
>Category:       kern
>Synopsis:       SMP kernel crash (thread related ?)
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue May 11 10:46:00 UTC 2004
>Closed-Date:
>Last-Modified:
>Originator:     Manuel Bouyer
>Release:        NetBSD 2.0_BETA, 200405050000 build
>Organization:
LIP6, Universite Paris VI.
>Environment:
System: NetBSD antifer.ipv6.lip6.fr 2.0_BETA NetBSD 2.0_BETA (GENERIC.MP) #0: Sat May 8 00:33:21 UTC 2004 autobuild@tgm.netbsd.org:/autobuild/netbsd-2-0/i386/OBJ/autobuild/netbsd-2-0/src/sys/arch/i386/compile/GENERIC.MP i386
Architecture: i386
Machine: i386
NetBSD 2.0_BETA (GENERIC.MP) #0: Sat May  8 00:33:21 UTC 2004
        autobuild@tgm.netbsd.org:/autobuild/netbsd-2-0/i386/OBJ/autobuild/netbsd-2-0/src/sys/arch/i386/compile/GENERIC.MP
total memory = 97916 KB
avail memory = 87900 KB
BIOS32 rev. 0 found at 0xe0000
mainbus0 (root)
mainbus0: Intel MP Specification (Version 1.4) (COMPAQ   Workstation )
cpu0 at mainbus0: apid 1 (boot processor)
cpu0: Intel Pentium Pro (686-class), 199.45 MHz, id 0x617
cpu0: features fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features fbff<PGE,MCA,CMOV>
cpu0: I-cache 8 KB 32b/line 4-way, D-cache 8 KB 32b/line 2-way
cpu0: L2 cache 256 KB 32b/line 4-way
cpu0: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu0: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way
cpu0: calibrating local timer
cpu0: apic clock running at 66 MHz
cpu0: 16 page colors
cpu1 at mainbus0: apid 0 (application processor)
cpu1: starting
cpu1: Intel Pentium Pro (686-class), 199.43 MHz, id 0x619
cpu1: features fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu1: features fbff<PGE,MCA,CMOV>
cpu1: I-cache 8 KB 32b/line 4-way, D-cache 8 KB 32b/line 2-way
cpu1: L2 cache 256 KB 32b/line 4-way
cpu1: ITLB 32 4 KB entries 4-way, 2 4 MB entries fully associative
cpu1: DTLB 64 4 KB entries 4-way, 8 4 MB entries 4-way

>Description:
	While loading a page with lots of images in firefox, the browser
        stopped responding. top showed it didn't have much CPU activity,
        waiting on poll(). I then tried to kill it. at first kill and kill -9
        from top didn't do anything. Then I tried a 'kill -9 %' from the
	xterm where I stared it, and the screen frooze. I blindly typed
        ctrl-alt-esc then reboot(0x104) and got a core dump.

antifer# ps -axl -M netbsd.0.core 
UID  PID       PPID CPU PRI NI   VSZ RSS WCHAN    STAT TT    TIME COMMAND
  0    0 1208619026   0 -18  0     0   0 schedule RWKs ?? 0:00.00 [swapper]
  0    1 1208619026   0  10  0    64   0 wait     RWs  ?? 0:00.00 init 
  0    2 1208619026   0  14  0     0   0 crypto_w RWK  ?? 0:00.00 [cryptoret]
  0    3 1208619026   0  -6  0     0   0 sccomp   RWK  ?? 0:00.00 [scsibus0]
  0    4 1208619026   0  -6  0     0   0 atath    RWK  ?? 0:00.00 [atabus0]
  0    5 1208619026   0  -6  0     0   0 atath    RWK  ?? 0:00.00 [atabus1]
  0    6 1208619026   0  -6  0     0   0 sccomp   RWK  ?? 0:00.00 [atapibus0]
  0    7 1208619026   0 -18  0     0   0 lfswrite RWK  ?? 0:00.00 [lfs_writer]
  0    8 1208619026   0 -18  0     0   0 pgdaemon RWK  ?? 0:00.00 [pagedaemon]
  0    9 1208619026   0  18  0     0   0 syncer   RWK  ?? 0:00.00 [ioflush]
  0   10 1208619026   0 -18  0     0   0 aiodoned RWK  ?? 0:00.00 [aiodoned]
  0   97 1208619026   0   2  0 32648   0 select   RWs  ?? 0:00.00 /usr/X11R6/bi
  0   98 1208619026   1  10  0   400   0 wait     RWs  ?? 0:01.00 xdm: :0 
  0   99 1208619026  15   2  0   340   0 select   RWs  ?? 0:15.00 /usr/sbin/ssh
 17  106 1208619026   0  18  0   964   0 pause    RWs  ?? 0:00.00 sendmail: Que
  0  193 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
331  196 1208619026   0   2  0   496   0 select   RWs  ?? 0:00.00 (fetchmail)
  0  208 1208619026   0   2  0   292   0 -        RWs  ?? 0:00.00 /usr/sbin/sys
  0  237 1208619026   0   2  0   148   0 select   RWs  ?? 0:00.00 /usr/sbin/ypb
  0  238 1208619026   0   2  0   324   0 poll     RWs  ?? 0:00.00 /usr/sbin/rpc
  0  254 1208619026  21   2  0    44   0 nfsd     RWL  ?? 0:21.00 nfsd: server 
  0  274 1208619026   0  10  0     0   0 nfsidl   RWK  ?? 0:00.00 [nfsio]
  0  275 1208619026   0   2  0   440   0 select   RWs  ?? 0:00.00 (amd)
  0  278 1208619026   0  10  0     0   0 nfsidl   RWK  ?? 0:00.00 [nfsio]
  0  279 1208619026   0  10  0     0   0 nfsidl   RWK  ?? 0:00.00 [nfsio]
  0  284 1208619026   0  10  0     0   0 nfsidl   RWK  ?? 0:00.00 [nfsio]
  0  286 1208619026   0  10  0   220   0 nanoslee RWs  ?? 0:00.00 /usr/sbin/cro
  0  312 1208619026  21   2  0    44   0 nfsd     RWL  ?? 0:21.00 (nfsd)
  0  313 1208619026  21   2  0   112   0 poll     RWs  ?? 0:21.00 nfsd: master 
  0  317 1208619026  21   2  0    44   0 nfsd     RWL  ?? 0:21.00 nfsd: server 
  0  332 1208619026  21   2  0    44   0 nfsd     RWL  ?? 0:21.00 nfsd: server 
  0  345 1208619026  25   2  0    60   0 kqread   RWs  ?? 0:25.00 /usr/sbin/ine
  0  373 1208619026  21   2  0   116   0 poll     RWs  ?? 0:21.00 (lpd)
  0  495 1208619026   0  18  0  1140   0 pause    RWs  ?? 0:00.00 (ntpd)
  0  539 1208619026  15   2  0   172   0 select   RWs  ?? 0:15.00 (xdm)
331 1328 1208619026   0   2  0   420   0 poll     RW   ?? 0:00.00 (xmeter)
331 1463 1208619026  16 -22  0     0   0 -        ZW   ?? 0:00.00 (xli)
  0 1777 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
  0 1855 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
  0 2047 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
331 2746 1208619026   1   2  0   196   0 poll     RW   ?? 0:01.00 (xmailbox)
  0 2981 1208619026   0   2  0   696   0 -        RWs  ?? 0:00.00 (xterm)
  0 3046 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
331 3048 1208619026   0   2  0   304   0 select   RW   ?? 0:00.00 (fvwm)
331 3051 1208619026   0   2  0   128   0 poll     RW   ?? 0:00.00 (oclock)
331 3165 1208619026   2  18  0   212   0 pause    RW   ?? 0:02.00 (csh)
  0 3190 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
331 3250 1208619026  36 -22  0     0   0 -        ZW   ?? 0:00.00 (csh)
  0 3352 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
  0 3452 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
  0 3579 1208619026   0   2  0   700   0 select   RW   ?? 0:00.00 (xterm)
331 3636 1208619026   0   2  0   208   0 poll     RW   ?? 0:00.00 (xload)
331 3300 1208619026   0   2  0   432   0 select   RWs+ p0 0:00.00 (ssh)
331 3113 1208619026   9   3  0   884   0 ttyin    RWs+ p1 0:09.00 (tcsh)
331 1623 1208619026   0   2  0   256   0 poll     RW+  p2 0:00.00 (top)
331 3349 1208619026   0  18  0   984   0 pause    RWs  p2 0:00.00 (tcsh)
331 1357 1208619026  19   3  0   884   0 ttyin    RWs+ p3 0:19.00 (tcsh)
331 3315 1208619026  13   3  0   884   0 ttyin    RWs+ p4 0:13.00 (tcsh)
331 1313 1208619026  36  64  0 12924   0 -        RWLa p5 0:36.00 (firefox-bin)
331 1679 1208619026   1  29  0   980   0 -        RWs+ p5 0:01.00 (tcsh)
331 3537 1208619026   7   3  0   884   0 ttyin    RWs+ p6 0:07.00 (tcsh)
331 3019 1208619026   0   2  0   676   0 select   RWs+ p7 0:00.00 (ssh)
331  194 1208619026   0   2  0   568   0 select   RWs+ p8 0:00.00 (ssh)
331  330 1208619026   0   2  0   492   0 select   RW+  p9 0:00.00 (ssh)
331  812 1208619026   0  18  0   888   0 pause    RWs  p9 0:00.00 (tcsh)
  0  660 1208619026  17   3  0    48   0 ttyin    RWs+ E0 0:17.00 (getty)
  0  160 1208619026  17   3  0    48   0 ttyin    RWs+ E1 0:17.00 /usr/libexec/
  0  671 1208619026  17   3  0    48   0 ttyin    RWs+ E2 0:17.00 /usr/libexec/
  0  666 1208619026  17   3  0    48   0 ttyin    RWs+ E3 0:17.00 (getty)

(gdb) target kcore netbsd.0.core
#0  0x00000001 in ?? ()
(gdb) where
#0  0x00000001 in ?? ()
#1  0xc0424f0f in cpu_reboot ()
#2  0xc034bd41 in db_reboot_cmd ()
#3  0xc034b887 in db_command ()
#4  0xc034b597 in db_command_loop ()
#5  0xc034e69f in db_trap ()
#6  0xc04224de in kdb_trap ()
#7  0xc0430dc2 in trap ()
#8  0xc010c6bf in calltrap ()
#9  0xc051bc38 in internal_command ()
#10 0xc051bd14 in wskbd_translate ()
#11 0xc051b9f5 in wskbd_cngetc ()
#12 0xc0432155 in cngetc ()
#13 0xc034d07d in db_readline ()
#14 0xc034d126 in db_read_line ()
#15 0xc034b586 in db_command_loop ()
#16 0xc034e69f in db_trap ()
#17 0xc04224de in kdb_trap ()
#18 0xc0430dc2 in trap ()
#19 0xc010c6bf in calltrap ()
#20 0xc04307af in syscall_plain ()

antifer# ps -ax -O paddr -M netbsd.0.core |grep firefox
1313 c54c0334 p5 RWLa 0:36.00 (firefox-bin)
(gdb) proc 0xc54c0334
can not access 0x24, invalid translation (invalid PTE)
can not access 0x24, invalid translation (invalid PTE)
cannot read pcb at 0x24

        
>How-To-Repeat:
	looks random. I'm using this box for some time now, it's the first
        crash. But I upgraded kernel and userland (base only, I didn't
        rebuild the packages) to the 20040505 snapshot yesterday.
        Before that it was running:
May  7 18:05:49 antifer /netbsd: NetBSD 2.0_BETA (GENERIC.MP) #0: Tue Mar 30 17:
43:13 CEST 2004
May  7 18:05:49 antifer /netbsd:        bouyer@pop:/local/pop1/bouyer/tmp/i386/o
bj/local/pop1/bouyer/current/src/sys/arch/i386/compile/GENERIC.MP

>Fix:
	unknown.
>Release-Note:
>Audit-Trail:
>Unformatted: