Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Paul Ripke <stixpjr@ozemail.com.au>
List: current-users
Date: 03/11/2003 16:18:26
On Tuesday, Mar 11, 2003, at 11:16 Australia/Sydney, Brian Buhrow wrote:

> 	Hello folks.  I've got a machine running NetBSD-1.6.1_RC2 with sources
> as of February 28, 2003.  This machine, unlike others I have running  
> the
> same code, consistently either hangs or panics every 24-48 hours.  The
> primary difference between this machine and the rest of the ones I have
> running the same code is that it is using a raidframe raid5 device for  
> all
> of its disk storage.  When it hangs, it remains pingable on the net,  
> but
> cannot be interrigated via the serial console and must be reset.

The hang might be PR kern/20191 which I reported recently. I've only  
tested
current (since around mid January), but 1.6.1_RC2 may have the same
problem. Are you running softdep on the RAID5? I found that softdep made
the hang far more repeatable. I've provided a core to Greg Oster from a
hang in single user, which I presume he's currently chewing on...

> 	I was able to capture the latest panic dump, and it looks like it is
> taking an illegal page fault while trying to run the syncer kernel  
> thread.
> Specifically, it faulted in genfs_putpages() as a result of an
> ffs_full_sync().  In the excerpt from the dmesg of the crash below, it
> double panics because the lockmgr can't get a lock to sync the disks.
> 	Does anyone have any ideas?  This is an I386 machine, and it is almost
> unusable as a server in its current state.  I have a full panic core  
> file,
> if that would help.  I'm also willing to try things if folks have
> suggestions.
> -thanks
> -Brian
>
> NetBSD 1.6.1_RC2 (NFBNETBSD) #0: Fri Mar  7 08:23:54 PST 2003
>      
> buhrow@lothlorien.nfbcal.org:/usr/local/netbsd/src/sys/arch/i386/ 
> compile/NFBNETBSD
> cpu0: Intel Pentium III (Coppermine) (686-class), 756.83 MHz
> cpu0: I-cache 16 KB 32b/line 4-way, D-cache 16 KB 32b/line 2-way
> cpu0: L2 cache 256 KB 32b/line 8-way
> cpu0: features 383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR>
> cpu0: features 383f9ff<PGE,MCA,CMOV,FGPAT,PSE36,MMX>
> cpu0: features 383f9ff<FXSR,SSE>
> total memory = 126 MB
> avail memory = 112 MB
> [...]
> Kernelized RAIDframe activated
> RAID autoconfigure
> Configuring raid0:
> RAIDFRAME: protectedSectors is 64
> RAIDFRAME: Configure (RAID Level 5): total number of sectors is  
> 304213760 (148541 MB)
> RAIDFRAME(RAID Level 5): Using 20 floating recon bufs with head sep  
> limit 10
> boot device: raid0
> root on raid0a dumps on wd0b
> root file system type: ffs
> raid0: Device already configured!
> uvm_fault(0xc05d7320, 0xffc00000, 0, 1) -> e
> fatal page fault in supervisor mode
> trap type 6 code 0 eip c0311347 cs 8 eflags 10202 cr2 ffc000c4 cpl 0
> panic: trap
> syncing disks... panic: lockmgr: locking against myself
>
> dumping to dev 0,1 offset 1837871
> dump 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111  
> 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91  
> 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68  
> 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45  
> 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22  
> 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Cheers,
--
Paul Ripke
Unix/OpenVMS/TSM/DBA
101 reasons why you can't find your Sysadmin:
68: It's 9AM. He/She is not working that late.
-- Koos van den Hout