tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: tstile syndrome



On Thu, Aug 27, 2009 at 03:00:11PM +0200, Manuel Bouyer wrote:

> On Thu, Aug 27, 2009 at 01:09:16PM +0200, Manuel Bouyer wrote:
> > Hi,
> > here's what I found so far on a server that show the tstile hang,
> > with some ddb+gdb playing.
> > 
> > Most processes are waiting on a tunrstile (you did know that),
> > the one I started with had more than 4000 writers in the queue.
> > The threads did come here though a VOP_LOCK() (you did also know that).
> > This is a tunrstile for a rwlock, I found the owner of this rwlock.
> > This thread is also waiting on a turnstile, but a different one,
> > it also did come here though a VOP_LOCK. This is also a turnstile for a
> > rwlock, which also has a owner, which also has VOP_LOCK in his stack
> > trace and is waiting on a turnstile. It's also a rwlock (I checked the
> > l_syncobj) but l_wchan is bogus: ffff800079ac402f, this is not a
> > valid krwlock_t* (and examining memory at this address doesn't look like
> > a valid krwlock_t value, and 'show lock' doens't know about it either). 
> 
> I think I mixed up pointer and values at one point.
> I got another instance of the tstile deadlock and I think I found the
> cause:
> 
> ffff800079987800 wchan_t 0xffff80008e572958 syncobj 0xffffffff806cf280 rw
> owner 0xffff8000d47a3baf
> 
> ffff8000d47a3ba0 wchan_t 0xffff80008f301290 syncobj 0xffffffff806cf280 rw
> owner 0xffff80007998780f
> 
> So ffff800079987800 is waiting on a lock held by 0xffff8000d47a3ba0, and
> ffff8000d47a3ba0 is waiting on a lock held by 0xffff800079987800.
> 
> here's the stack trace for both processes:
> db{0}> tr/a ffff800079987800
> trace: pid 21115 lid 1 at 0xffff80007931c710
> sleepq_block() at netbsd:sleepq_block+0xec
> turnstile_block() at netbsd:turnstile_block+0x29e
> rw_vector_enter() at netbsd:rw_vector_enter+0x28c
> vlockmgr() at netbsd:vlockmgr+0xf6
> VOP_LOCK() at netbsd:VOP_LOCK+0x64
> vn_lock() at netbsd:vn_lock+0xd9
> wapbl_ufs_rename() at netbsd:wapbl_ufs_rename+0x5ab
> ufs_rename() at netbsd:ufs_rename+0x39
> VOP_RENAME() at netbsd:VOP_RENAME+0x75
> do_sys_rename() at netbsd:do_sys_rename+0x57d
> syscall() at netbsd:syscall+0xb6      
> db{0}> tr/a ffff8000d47a3ba0
> trace: pid 25624 lid 1 at 0xffff8000d47cb650
> sleepq_block() at netbsd:sleepq_block+0xec
> turnstile_block() at netbsd:turnstile_block+0x29e
> rw_vector_enter() at netbsd:rw_vector_enter+0x28c
> vlockmgr() at netbsd:vlockmgr+0xf6    
> VOP_LOCK() at netbsd:VOP_LOCK+0x64    
> vn_lock() at netbsd:vn_lock+0xd9      
> cache_lookup() at netbsd:cache_lookup+0x201
> ufs_lookup() at netbsd:ufs_lookup+0xcd
> VOP_LOOKUP() at netbsd:VOP_LOOKUP+0x80
> lookup() at netbsd:lookup+0x34b       
> namei() at netbsd:namei+0x1a4
> do_sys_stat() at netbsd:do_sys_stat+0x44
> sys___lstat30() at netbsd:sys___lstat30+0x2a
> syscall() at netbsd:syscall+0xb6      
> 
> Any idea on how to fix this ?

WAPBL's copy of ufs_rename() is quite out of date - I would begin by finding
out why and determine if it can be folded into the stock ufs_rename().


Home | Main Index | Thread Index | Old Index