Re: kern/39993: lockup on i386 SMP (raidframe related ?)

To: kern-bug-people%NetBSD.org@localhost, gnats-admin%NetBSD.org@localhost, netbsd-bugs%NetBSD.org@localhost
Subject: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
From: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Date: Sat, 22 Nov 2008 15:03:41 +0100

On Fri, Nov 21, 2008 at 11:35:00AM +0000, bouyer%antioche.eu.org@localhost 
wrote:
> >Number:         39993
> >Category:       kern
> >Synopsis:       lockup on i386 SMP (raidframe related ?)
> >Confidential:   no
> >Severity:       critical
> >Priority:       high
> >Responsible:    kern-bug-people
> >State:          open
> >Class:          sw-bug
> >Submitter-Id:   net
> >Arrival-Date:   Fri Nov 21 11:35:00 +0000 2008
> >Originator:     Manuel Bouyer
> >Release:        NetBSD 5.0_BETA
> >Organization:
> >Environment:
> System: NetBSD antioche.lip6.fr 5.0_BETA NetBSD 5.0_BETA (ANTIOCHE5) #2: Thu 
> Nov 20 23:55:28 CET 2008 
> bouyer@roll:/dsk/l1/misc/bouyer/tmp/i386/obj/dsk/l1/misc/bouyer/netbsd-5/src/sys/arch/i386/compile/ANTIOCHE5
>  i386
> Architecture: i386
> Machine: i386
> >Description:
>       This system is a dual-CPU PIII system, with several SCSI disks on
>       multiple esiop controllers. Some of them are part of raid-1
>       raidframe volume (2 disks per volume). A SMP kernel will lookup
>       within minutes after boot, under I/O load. The system is unresponsive
>       to network or console (no ping, and no characters echoed on serial
>       console) but I could enter ddb using cnmagic sequence. Here's
>       what I found from ddb:
> fatal breakpoint trap in supervisor mode
> trap type 1 code 0 eip c03aabec cs 8 eflags 202 cr2 bb504000 ilevel 8
> Stopped in pid 0.4 (system) at  netbsd:breakpoint+0x4:  popl    %ebp
> db{0}> tr
> breakpoint(0,3f8,0,6,ca7953c0,cbd01900,cbb5beb0,c1540010,c1541000,7fa) at 
> netbsd:breakpoint+0x4
> comintr(cbd017f4,cbb5bec0,6,10,c04f0030,cbb50010,c04f0010,c04f4800,1,cbb5bf4c)
>  at netbsd:comintr+0x566  
> Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0xa9
> --- interrupt ---
> fatal page fault in supervisor mode
> trap type 6 code 0 eip c03ad04f cs 8 eflags 10206 cr2 3e ilevel 8
> kernel: supervisor trap page fault, code=0
> Faulted in DDB; continuing...
> db{0}> mach cpu 1
> using CPU 1
> db{0}> tr
> __cpu_simple_lock(c1702c40,c15a99f8,0,0,c15560d8,c15560d0,0,c1556000,c01c52c0,
>  cc5a2d20) at netbsd:__cpu_simple_lock+0x1c
> rf_RaidIOThread(c1556000,0,c01002a7,0,c01002a7,0,0,0,0,0) at 
> netbsd:rf_RaidIOThread+0x7f
> 
> rf_RaidIOThread+0x7f is:
> 0xc01c533f is in rf_RaidIOThread 
> (/dsk/l1/misc/bouyer/netbsd-5/src/sys/dev/raidf
> rame/rf_engine.c:863).
> 858                     /* See what I/Os, if any, have arrived */
> 859                     while ((req = TAILQ_FIRST(&(raidPtr->iodone))) != 
> NULL) {
> 860                             TAILQ_REMOVE(&(raidPtr->iodone), req, 
> iodone_entries);
> 861                             simple_unlock(&(raidPtr->iodone_lock));
> 862                             rf_DiskIOComplete(req->queue, req, 
> req->error); 
> 863                             (req->CompleteFunc) (req->argument, 
> req->error);
> 864                             simple_lock(&(raidPtr->iodone_lock));
> 865                     }
> 
> 0xc01c5324 <rf_RaidIOThread+100>:       call   0xc010cd90 
> <__cpu_simple_unlock> 
> 0xc01c5329 <rf_RaidIOThread+105>:       mov    0x70(%ebx),%eax
> 0xc01c532c <rf_RaidIOThread+108>:       mov    0x58(%ebx),%edx
> 0xc01c532f <rf_RaidIOThread+111>:       mov    %ebx,0x4(%esp)
> 0xc01c5333 <rf_RaidIOThread+115>:       mov    %eax,0x8(%esp)
> 0xc01c5337 <rf_RaidIOThread+119>:       mov    %edx,(%esp)
> 0xc01c533a <rf_RaidIOThread+122>:       call   0xc01c1780 <rf_DiskIOComplete>
> 0xc01c533f <rf_RaidIOThread+127>:       mov    0x2c(%ebx),%edx
> 0xc01c5342 <rf_RaidIOThread+130>:       mov    0x70(%ebx),%eax
> 0xc01c5345 <rf_RaidIOThread+133>:       mov    0x28(%ebx),%ecx
> 0xc01c5348 <rf_RaidIOThread+136>:       mov    %edx,(%esp)
> 0xc01c534b <rf_RaidIOThread+139>:       mov    %eax,0x4(%esp)
> 0xc01c534f <rf_RaidIOThread+143>:       call   *%ecx
> 0xc01c5351 <rf_RaidIOThread+145>:       mov    %esi,(%esp)
> 0xc01c5354 <rf_RaidIOThread+148>:       call   0xc010cd70 <__cpu_simple_lock>

here's what gdb says about it:
#0  0xc03b11e7 in cpu_reboot ()
#1  0xc0305348 in panic ()
#2  0xc02fdcdb in lockdebug_abort1 ()
#3  0xc02d3a24 in mutex_vector_enter ()
#4  0xc02e87b6 in suspendsched ()
#5  0xc03467a3 in vfs_shutdown ()
#6  0xc03b1227 in cpu_reboot ()
#7  0xc01a3fa8 in db_reboot_cmd ()
#8  0xc01a3ab8 in db_command ()
#9  0xc01a3e02 in db_command_loop ()
#10 0xc01a6d10 in db_trap ()
#11 0xc03ac4bb in kdb_trap ()
#12 0xc03b3283 in trap ()
#13 0xc010cc36 in calltrap ()
#14 0xc03aabec in breakpoint ()
#15 0xc01ec4a6 in comintr ()
#16 0xc0103949 in Xintr_ioapic_edge4 ()
#17 0xc02ce55f in _kernel_lock ()
#18 0xc039bd56 in intr_biglock_wrapper ()
#19 0xc010718d in Xintr_ioapic_level2 ()
#20 0xc03aac85 in x86_stihlt ()
Previous frame inner to this frame (corrupt stack?)

Any idea on how to get more informations ?

-- 
Manuel Bouyer <bouyer%antioche.eu.org@localhost>
     NetBSD: 26 ans d'experience feront toujours la difference
--

References:
- kern/39993: lockup on i386 SMP (raidframe related ?)
  - From: bouyer

Prev by Date: Re: port-sparc/39942 (kernel crossbuild fails with "unused variable 'ch0_is_cons'")
Next by Date: kern/40004: ps core dumps when working on kernel core dump
Previous by Thread: kern/39993: lockup on i386 SMP (raidframe related ?)
Next by Thread: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Indexes:

Home | Main Index | Thread Index | Old Index