current-users: Re: Possible serious bug in NetBSD-1.6.1

Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Paul Ripke <stixpjr@ozemail.com.au>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: current-users
Date: 03/11/2003 02:18:58
	Hello Paul.  I believe it is the same bug you encountered.  I've a few
observations about the bug, which I hope will be helpful in resolving it.

1.  It's definitely related to the use of the raid5 level device, including
raw partitions.  

2.  the bug exists in NetBSD 1.6, but not in 1.5R, early 2001 code.

3.  To reliably reproduce the hang:
A.  Define a swap partition on your raid5 device.

b.  Turn that partition onto the system with swapctl.

C.  Watch the system go into the deep freezer when you try to link a kernel
with debugging symbols turned on and swap is needed.

4.  Softdep exaserbates the problem, but, it's not softdep which is to
blame here.


	Has Greg indicated whether or not he has any ideas on the matter?
-thanks
-Brian
On Mar 11,  4:18pm, Paul Ripke wrote:
} Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
} On Tuesday, Mar 11, 2003, at 11:16 Australia/Sydney, Brian Buhrow wrote:
} 
} > 	Hello folks.  I've got a machine running NetBSD-1.6.1_RC2 with sources
} > as of February 28, 2003.  This machine, unlike others I have running  
} > the
} > same code, consistently either hangs or panics every 24-48 hours.  The
} > primary difference between this machine and the rest of the ones I have
} > running the same code is that it is using a raidframe raid5 device for  
} > all
} > of its disk storage.  When it hangs, it remains pingable on the net,  
} > but
} > cannot be interrigated via the serial console and must be reset.
} 
} The hang might be PR kern/20191 which I reported recently. I've only  
} tested
} current (since around mid January), but 1.6.1_RC2 may have the same
} problem. Are you running softdep on the RAID5? I found that softdep made
} the hang far more repeatable. I've provided a core to Greg Oster from a
} hang in single user, which I presume he's currently chewing on...
} 
} > 	I was able to capture the latest panic dump, and it looks like it is
} > taking an illegal page fault while trying to run the syncer kernel  
} > thread.
} > Specifically, it faulted in genfs_putpages() as a result of an
} > ffs_full_sync().  In the excerpt from the dmesg of the crash below, it
} > double panics because the lockmgr can't get a lock to sync the disks.
} > 	Does anyone have any ideas?  This is an I386 machine, and it is almost
} > unusable as a server in its current state.  I have a full panic core  
} > file,
} > if that would help.  I'm also willing to try things if folks have
} > suggestions.
} > -thanks
} > -Brian
} >
} > NetBSD 1.6.1_RC2 (NFBNETBSD) #0: Fri Mar  7 08:23:54 PST 2003
} >      
} > buhrow@lothlorien.nfbcal.org:/usr/local/netbsd/src/sys/arch/i386/ 
} > compile/NFBNETBSD
} > cpu0: Intel Pentium III (Coppermine) (686-class), 756.83 MHz
} > cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 2-way
} > cpu0: L2 cache 256 KB 32b/line 8-way
} > cpu0: features 383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR>
} > cpu0: features 383f9ff<PGE,MCA,CMOV,FGPAT,PSE36,MMX>
} > cpu0: features 383f9ff<FXSR,SSE>
} > total memory = 126 MB
} > avail memory = 112 MB
} > [...]
} > Kernelized RAIDframe activated
} > RAID autoconfigure
} > Configuring raid0:
} > RAIDFRAME: protectedSectors is 64
} > RAIDFRAME: Configure (RAID Level 5): total number of sectors is  
} > 304213760 (148541 MB)
} > RAIDFRAME(RAID Level 5): Using 20 floating recon bufs with head sep  
} > limit 10
} > boot device: raid0
} > root on raid0a dumps on wd0b
} > root file system type: ffs
} > raid0: Device already configured!
} > uvm_fault(0xc05d7320, 0xffc00000, 0, 1) -> e
} > fatal page fault in supervisor mode
} > trap type 6 code 0 eip c0311347 cs 8 eflags 10202 cr2 ffc000c4 cpl 0
} > panic: trap
} > syncing disks... panic: lockmgr: locking against myself
} >
} > dumping to dev 0,1 offset 1837871
} > dump 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111  
} > 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91  
} > 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68  
} > 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45  
} > 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22  
} > 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
} 
} Cheers,
} --
} Paul Ripke
} Unix/OpenVMS/TSM/DBA
} 101 reasons why you can't find your Sysadmin:
} 68: It's 9AM. He/She is not working that late.
} -- Koos van den Hout
} 
>-- End of excerpt from Paul Ripke