current-users: Re: Possible serious bug in NetBSD-1.6.1

Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Paul Ripke <stixpjr@ozemail.com.au>
List: current-users
Date: 03/11/2003 23:20:59
[ Greg, I've CC'ed you in on this, hope you don't mind, just in case 
you've
missed this thread, here's some more info. And, IMNSHO, an almost 100%
reproducible test case, if my understanding is anywhere close to the 
mark. ]

On Tuesday, Mar 11, 2003, at 21:18 Australia/Sydney, Brian Buhrow wrote:

> 	Hello Paul.  I believe it is the same bug you encountered.  I've a few
> observations about the bug, which I hope will be helpful in resolving 
> it.
>
> 1.  It's definitely related to the use of the raid5 level device, 
> including
> raw partitions.
>
> 2.  the bug exists in NetBSD 1.6, but not in 1.5R, early 2001 code.

Good to know - this will definitely help, I'm sure.

> 3.  To reliably reproduce the hang:
> A.  Define a swap partition on your raid5 device.
>
> b.  Turn that partition onto the system with swapctl.
>
> C.  Watch the system go into the deep freezer when you try to link a 
> kernel
> with debugging symbols turned on and swap is needed.

Hmm... swap on RAID5... never thought of doing that. Think I've only 
ever
seen it mirrored... nup, correction, have seen swap on hardware RAID5 on
Tru64 with internal PCI RAID controller. OK, back to NetBSD, yes, I can
see how swap on RAIDframe RAID5 would exacerbate this problem!

> 4.  Softdep exaserbates the problem, but, it's not softdep which is to
> blame here.
>
> 	Has Greg indicated whether or not he has any ideas on the matter?

I'm CC'ing Greg in, I have a hunch he understands the problem, but the
fix is more a design problem than bug squashing. Greg, correct me if I'm
wrong...

> -thanks
> -Brian
>
> On Mar 11,  4:18pm, Paul Ripke wrote:
> } Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
> } On Tuesday, Mar 11, 2003, at 11:16 Australia/Sydney, Brian Buhrow 
> wrote:
> }
> } > 	Hello folks.  I've got a machine running NetBSD-1.6.1_RC2 with 
> sources
> } > as of February 28, 2003.  This machine, unlike others I have 
> running
> } > the
> } > same code, consistently either hangs or panics every 24-48 hours.  
> The
> } > primary difference between this machine and the rest of the ones I 
> have
> } > running the same code is that it is using a raidframe raid5 device 
> for
> } > all
> } > of its disk storage.  When it hangs, it remains pingable on the 
> net,
> } > but
> } > cannot be interrigated via the serial console and must be reset.
> }
> } The hang might be PR kern/20191 which I reported recently. I've only
> } tested
> } current (since around mid January), but 1.6.1_RC2 may have the same
> } problem. Are you running softdep on the RAID5? I found that softdep 
> made
> } the hang far more repeatable. I've provided a core to Greg Oster 
> from a
> } hang in single user, which I presume he's currently chewing on...
> }
> } > 	I was able to capture the latest panic dump, and it looks like it 
> is
> } > taking an illegal page fault while trying to run the syncer kernel
> } > thread.
> } > Specifically, it faulted in genfs_putpages() as a result of an
> } > ffs_full_sync().  In the excerpt from the dmesg of the crash 
> below, it
> } > double panics because the lockmgr can't get a lock to sync the 
> disks.
> } > 	Does anyone have any ideas?  This is an I386 machine, and it is 
> almost
> } > unusable as a server in its current state.  I have a full panic 
> core
> } > file,
> } > if that would help.  I'm also willing to try things if folks have
> } > suggestions.
> } > -thanks
> } > -Brian
> } >
> } > NetBSD 1.6.1_RC2 (NFBNETBSD) #0: Fri Mar  7 08:23:54 PST 2003
> } >
> } > buhrow@lothlorien.nfbcal.org:/usr/local/netbsd/src/sys/arch/i386/
> } > compile/NFBNETBSD
> } > cpu0: Intel Pentium III (Coppermine) (686-class), 756.83 MHz
> } > cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 2-way
> } > cpu0: L2 cache 256 KB 32b/line 8-way
> } > cpu0: features 383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR>
> } > cpu0: features 383f9ff<PGE,MCA,CMOV,FGPAT,PSE36,MMX>
> } > cpu0: features 383f9ff<FXSR,SSE>
> } > total memory = 126 MB
> } > avail memory = 112 MB
> } > [...]
> } > Kernelized RAIDframe activated
> } > RAID autoconfigure
> } > Configuring raid0:
> } > RAIDFRAME: protectedSectors is 64
> } > RAIDFRAME: Configure (RAID Level 5): total number of sectors is
> } > 304213760 (148541 MB)
> } > RAIDFRAME(RAID Level 5): Using 20 floating recon bufs with head sep
> } > limit 10
> } > boot device: raid0
> } > root on raid0a dumps on wd0b
> } > root file system type: ffs
> } > raid0: Device already configured!
> } > uvm_fault(0xc05d7320, 0xffc00000, 0, 1) -> e
> } > fatal page fault in supervisor mode
> } > trap type 6 code 0 eip c0311347 cs 8 eflags 10202 cr2 ffc000c4 cpl 
> 0
> } > panic: trap
> } > syncing disks... panic: lockmgr: locking against myself
> } >
> } > dumping to dev 0,1 offset 1837871
> } > dump 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 
> 111
> } > 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 
> 92 91
> } > 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 
> 68
> } > 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 
> 45
> } > 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 
> 22
> } > 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
> }
> } Cheers,
> } --
> } Paul Ripke
> } Unix/OpenVMS/TSM/DBA
> } 101 reasons why you can't find your Sysadmin:
> } 68: It's 9AM. He/She is not working that late.
> } -- Koos van den Hout
> }
>> -- End of excerpt from Paul Ripke

Cheers,
--
Paul Ripke
Unix/OpenVMS/TSM/DBA
101 reasons why you can't find your Sysadmin:
68: It's 9AM. He/She is not working that late.
-- Koos van den Hout