current-users: Re: Possible serious bug in NetBSD-1.6.1

Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
To: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
From: Greg Oster <oster@cs.usask.ca>
List: current-users
Date: 03/12/2003 11:43:54
Brian Buhrow writes:
> 	Hello Greg.  I tried a similar experiment here, and found the same
> result.  Swapping to a raw disk partition, as opposed to a raid partition, wo
> rks
> fine, even if the rest of the filesystem i/o is going through the raid.
> 	this experience triggered a memory I had using raidframe under 1.5R.
> On that system, I could not run fsck -n on a mounted filesystem if the
> filesystem sat on a raid5 device because doing so would cause uvm to go
> into a tight loop claiming it was performing deadlock avoidance.  When the
> machine died this evening while I was paging to the raid 5 device, I
> noticed that it died gradually, as if some resource had become unavailable
> and then, as more things began to need that resource, things coasted to a
> halt.
> 	In thinking about this some more, I'm pretty sure that the problem
> isn't alack of kernel memory, as we previously thought.  Vmstat -m
> consistently showed only 3MB of kernel memory in use during the entire run
> up to death.  

Using kgdb, when swapping to RAID 5, after the "hang" I see:

(gdb) xps
              proc   pid     flag st              wchan comm
        0xcb25aad4   236     4006  3         0xc06bbc90 emacs (flt_noram5)
        0xcb25a908   235     4086  3         0xc06a7748 emacs (select)
        0xcb34a3a8   233     4006  3         0xc06bbc90 top (flt_noram1)
        0xcb34a1dc   225     4082  3         0xcb34b624 csh (pause)
        0xcb34a010   224     4084  3         0xc06a7748 telnetd (select)
        0xcb25a3a4   213     4006  3         0xc06bbc90 csh (uao_getpage)
        0xcb16d900   212     4084  3         0xc06a7748 telnetd (select)
        0xcb16d734   206     4082  3         0xca92c008 getty (ttyin)
        0xcb25a73c   204       80  3         0xc0681a60 cron (nanosleep)
        0xcb25a570   201       80  3         0xc06a7748 inetd (select)
        0xcb25a1d8   153       84  3         0xc0a12600 nfsd (nfsd)
        0xcb25a00c   152       84  3         0xc0a12800 nfsd (nfsd)
        0xcb196c9c   151       84  3         0xc0a12e00 nfsd (nfsd)
        0xcb16dc98   150       84  3         0xc0a12000 nfsd (nfsd)
        0xcb196ad0   148       80  3         0xc06a7748 nfsd (select)
        0xcb1961d4   140       80  3         0xc06a7748 mountd (select)
        0xcb196904   124    20284  3         0xc069cb4c nfsio (nfsidl)
        0xcb196738   123    20284  3         0xc069cb48 nfsio (nfsidl)
        0xcb19656c   122    20284  3         0xc069cb44 nfsio (nfsidl)
        0xcb1963a0   121    20284  3         0xc069cb40 nfsio (nfsidl)
        0xcb196008   112       80  3         0xc06a7748 rpcbind (select)
        0xcb16dacc   101        4  3         0xc077a13c syslogd (anonget2)
---Type <return> to continue, or q <return> to quit---
        0xcb16d568    12    20204  3         0xc06bbe38 aiodoned (aiodoned)
        0xcb16d39c    11    20204  3         0xc06bbc90 ioflush (km_getwait2)
        0xcb16d1d0    10    20204  3         0xc06a6bb0 reaper (reaper)
        0xcb16d004     9    20204  3         0xc06bbe2c pagedaemon (pgdaemon)
        0xcb154c94     8    20204  3         0xc06bbc90 raid (km_getwait2)
        0xcb154ac8     7    20204  3         0xc0988160 raid (rfwcond)
        0xcb1548fc     6    20204  3         0xc0968eac scsibus2 (sccomp)
        0xcb154730     5    20204  3         0xc095d4ac scsibus1 (sccomp)
        0xcb154564     4    20204  3         0xc09618ac scsibus0 (sccomp)
        0xcb154398     3    20204  3         0xc06cc5e8 usbtask (usbtsk)
        0xcb1541cc     2    20204  3         0xc0960238 usb0 (usbevt)
        0xcb154000     1     4080  3         0xcb154000 init (wait)
        0xc06a4e60     0    20204  3         0xc06a4e60 swapper (scheduler)
              proc   pid     flag st              wchan comm
(gdb)

which indicates lots of stuff is waiting on getting more kernel memory..

If you want to see "who was using what", grab:

 http://www.cs.usask.ca/staff/oster/swapdebug.notes

and then cross-reference the lines with /usr/include/sys/malloc.h to see
who was using what...  (I had these last night, but still havn't had time
to go through them carefully...  A quick look showed the RAID stuff not
using more than 400K.  I also havn't had time to look at what the flt_noram*
waiting is all about... I need to see how things behaving in -current as
well, given that many of the allocations are now poolified..)

> I'm thinking that perhaps there is a problem with the way
> physio() interacts with the locks you use to lock the indivividual
> components of the raid5 set.  There's definitely a lot of calling around
> inside the kernel while locks are set, and I wonder if the i/o pattern
> raid5 uses tends to excercise the locking algorithm more heavily than do
> the raid0 and raid1 i/o patterns.  Or, alternatively, if the raid5 pattern
> causes uvm to lock itself up.

No...  RAID 5's themselves work just fine.  It's swapping to them that is
broken.  (I've put RAID 5 sets through intense file-system tests, and they do 
just fine...)

> 	Note that I'm merely speculating based on relatively little knowledge
> about how the internals of raidframe and uvm work.  I'm merely trying to
> make some, hopefully, reasonable suggestions about where to look.
> 	Good luck with the testing.  If there's anything I can help with to
> further the process of squashing this little annoyance, let me know.  I
> love raidframe, and I'd love to see it work completely as one would expect.

If'n'when I have an "easy" patch, I'll get you to test it :)  (I suspect there 
won't be an "easy" patch though :-/ )


> On Mar 11, 10:57pm, Greg Oster wrote:
> } Subject: Re: Possible serious bug in NetBSD-1.6.1_RC2
> } Brian Buhrow writes:
> } > 2.  I'm notcertain, but my guess is that the reason raid5 works under 1.5
> R
> } > and not 1.6 is not so much due to the changes in raidframe itself, but,
> } > rather, changes in the way kernel memory is managed by uvm.  They may be
> } > only tuning changes, but something definitely changed.
> } 
> } So whatever the problem is, I can reliably lockup my test box in seconds
> } when swapping to a RAID 5 set... Using the same test I couldn't kill my 
> } test box when swapping to a RAID 1 set.  
> } 
> } Workaround to the problem: Swap to a RAID 1 set instead of a RAID 5 set.  
> } While reads might not be as fast, writes will be (likely) 4x faster, which 
> } should improve your swapping/paging performance too!
> } 
> } Figuring out what the problem is will require some additional testing,
> } and that will have to wait until at least tomorrow... :-}
> } 
> } Thanks to everyone who helped narrow down the problem...
> } 
> } Later...
> } 
> } Greg Oster
> } 
> } 
> } 
> } 
> >-- End of excerpt from Greg Oster
> 

Later...

Greg Oster