kern/39993: lockup on i386 SMP (raidframe related ?)

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/39993: lockup on i386 SMP (raidframe related ?)
From: bouyer%antioche.eu.org@localhost
Date: Fri, 21 Nov 2008 11:35:00 +0000 (UTC)

>Number:         39993
>Category:       kern
>Synopsis:       lockup on i386 SMP (raidframe related ?)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Nov 21 11:35:00 +0000 2008
>Originator:     Manuel Bouyer
>Release:        NetBSD 5.0_BETA
>Organization:
>Environment:
System: NetBSD antioche.lip6.fr 5.0_BETA NetBSD 5.0_BETA (ANTIOCHE5) #2: Thu 
Nov 20 23:55:28 CET 2008 
bouyer@roll:/dsk/l1/misc/bouyer/tmp/i386/obj/dsk/l1/misc/bouyer/netbsd-5/src/sys/arch/i386/compile/ANTIOCHE5
 i386
Architecture: i386
Machine: i386
>Description:
        This system is a dual-CPU PIII system, with several SCSI disks on
        multiple esiop controllers. Some of them are part of raid-1
        raidframe volume (2 disks per volume). A SMP kernel will lookup
        within minutes after boot, under I/O load. The system is unresponsive
        to network or console (no ping, and no characters echoed on serial
        console) but I could enter ddb using cnmagic sequence. Here's
        what I found from ddb:
fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c03aabec cs 8 eflags 202 cr2 bb504000 ilevel 8
Stopped in pid 0.4 (system) at  netbsd:breakpoint+0x4:  popl    %ebp
db{0}> tr
breakpoint(0,3f8,0,6,ca7953c0,cbd01900,cbb5beb0,c1540010,c1541000,7fa) at 
netbsd:breakpoint+0x4
comintr(cbd017f4,cbb5bec0,6,10,c04f0030,cbb50010,c04f0010,c04f4800,1,cbb5bf4c) 
at netbsd:comintr+0x566  
Xintr_ioapic_edge4() at netbsd:Xintr_ioapic_edge4+0xa9
--- interrupt ---
fatal page fault in supervisor mode
trap type 6 code 0 eip c03ad04f cs 8 eflags 10206 cr2 3e ilevel 8
kernel: supervisor trap page fault, code=0
Faulted in DDB; continuing...
db{0}> mach cpu 1
using CPU 1
db{0}> tr
__cpu_simple_lock(c1702c40,c15a99f8,0,0,c15560d8,c15560d0,0,c1556000,c01c52c0, 
cc5a2d20) at netbsd:__cpu_simple_lock+0x1c
rf_RaidIOThread(c1556000,0,c01002a7,0,c01002a7,0,0,0,0,0) at 
netbsd:rf_RaidIOThread+0x7f

rf_RaidIOThread+0x7f is:
0xc01c533f is in rf_RaidIOThread (/dsk/l1/misc/bouyer/netbsd-5/src/sys/dev/raidf
rame/rf_engine.c:863).
858                     /* See what I/Os, if any, have arrived */
859                     while ((req = TAILQ_FIRST(&(raidPtr->iodone))) != NULL) 
{
860                             TAILQ_REMOVE(&(raidPtr->iodone), req, 
iodone_entries);
861                             simple_unlock(&(raidPtr->iodone_lock));
862                             rf_DiskIOComplete(req->queue, req, req->error); 
863                             (req->CompleteFunc) (req->argument, req->error);
864                             simple_lock(&(raidPtr->iodone_lock));
865                     }

0xc01c5324 <rf_RaidIOThread+100>:       call   0xc010cd90 <__cpu_simple_unlock> 
0xc01c5329 <rf_RaidIOThread+105>:       mov    0x70(%ebx),%eax
0xc01c532c <rf_RaidIOThread+108>:       mov    0x58(%ebx),%edx
0xc01c532f <rf_RaidIOThread+111>:       mov    %ebx,0x4(%esp)
0xc01c5333 <rf_RaidIOThread+115>:       mov    %eax,0x8(%esp)
0xc01c5337 <rf_RaidIOThread+119>:       mov    %edx,(%esp)
0xc01c533a <rf_RaidIOThread+122>:       call   0xc01c1780 <rf_DiskIOComplete>
0xc01c533f <rf_RaidIOThread+127>:       mov    0x2c(%ebx),%edx
0xc01c5342 <rf_RaidIOThread+130>:       mov    0x70(%ebx),%eax
0xc01c5345 <rf_RaidIOThread+133>:       mov    0x28(%ebx),%ecx
0xc01c5348 <rf_RaidIOThread+136>:       mov    %edx,(%esp)
0xc01c534b <rf_RaidIOThread+139>:       mov    %eax,0x4(%esp)
0xc01c534f <rf_RaidIOThread+143>:       call   *%ecx
0xc01c5351 <rf_RaidIOThread+145>:       mov    %esi,(%esp)
0xc01c5354 <rf_RaidIOThread+148>:       call   0xc010cd70 <__cpu_simple_lock>


I also have a core dump. The stack traces were indentical in all hangs I got.
disabling SMP at boot (boot -1) work around the problem.
LOCKDEBUG+DEBUG+DIAGNOSTIC does't give any additionnal info. 
This hardware was running without issues under 3.1 with SMP.

>How-To-Repeat:
        boot a SMP system with raidframe and generate I/O ?
>Fix:
        unknown

Follow-Ups:
- Re: kern/39993: lockup on i386 SMP (raidframe related ?)
  - From: Manuel Bouyer

Prev by Date: Re: bin/39992: tar -C -T coredumps
Next by Date: Re: kern/39971 (Add support for Corega CG-USBRS232R as a serial port)
Previous by Thread: PR/39991 CVS commit: src/sys/arch/amd64/include
Next by Thread: Re: kern/39993: lockup on i386 SMP (raidframe related ?)
Indexes:

Home | Main Index | Thread Index | Old Index