tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Reclaiming vnodes

On Tue, Sep 15, 2009 at 12:54:00AM +0200, Adam Hamsik wrote:
> Hi,
> On Sep,Monday 14 2009, at 5:55 AM, matthew green wrote:
> >
> >
> >i'm still not entirely sure what the point of this patch is.  i
> >understand it helps zfs, but i don't understand why or how. i'm
> >also curious what sort of testing you've done.  i do not believe
> >that testing in qemu is sufficient.  how does it affect systems
> >that recycle vnodes a lot, such as older systems running a build?
> I do not have such system yet. What this patch does is that it uses  
> another thread to reclaims vnodes, this way vnodes are reclaimed in  
> different context than are allocated.

You REALLY need to test such a system. If you can't test one, you should 
not commit the change. Find a system and run it hard. Something like a few 
compiles going on and have a few "find" commands running in the 
background (on file systems each with more files than you have max 

> Current:
> Vnodes are allocated only if there are no vnodes on a free_list. If  
> there is a free vnode on a list it will be recycled which actually  
> means that it will call VOP_RECLAIM.
> In zfs there is a problem with calling getnewvnode form zfs_zget, in  
> some cases getnewvnode pick vnode from a free list and call  
> VOP_RECLAIM. This can lead to deadlock because VOP_RECLAIM can try to  
> lock same mutex as was hold by zfs_zget. This can't be easily fixed if  
> we do not want to touch and change whole zfs locking protocol.
> With Patch:
> Vnodes are only allocated and there is no vnode recycling. If number  
> of used vnodes in a system is > than kern.vnodes_num_hiwat(in percents  
> of maxvnodes) vrele thread will be woken up and it will start with  
> releasing of free vnodes until number of used vnodes is < than  
> kern.vnodes_num_lowat.

The problem with this is you're going to reduce caching just because you 
can't figure out how to fix the locking.

Also, it strikes me that your design doesn't scale well. Right now, each 
thread that wants a vnode becomes a vnode cleaning thread. So if we need 
to reclaim a lot of vnodes at once, all the waiting threads pitch in to do 
work. Without an easy way to dynamically scale the number of reclaiming 
threads, you introduce a resource allocation issue. Pick too few threads, 
and you can choke a busy system. Pick too many, and you waste space usable 
by other subsystems.

My suggestion on how to fix this is:

1) Get a mutex variant that always records lock owner (curlwp()) and a 
mutex_enter() variant that will report EDEADLK when you try to lock a lock 
you have.

2) Use said mutexes in zfs and use said mutex_enter() variant in the 
places that VOP_RECLAIM would hit.

3) Have VOP_RECLAIM report EDEADLK if you'd get a deadlock w/ the 
operation (and clean up appropriately).

4) have getvnode() try the next vnode if it gets EDEADLK.

I realize you're trying to get zfs working, and all of this is kinda 
piling more work in front of that. However I am concerned that what you've 
proposed will, in the log run, run us into more trouble than we have now.

Take care,


Attachment: pgpFM5nyMfT6Z.pgp
Description: PGP signature

Home | Main Index | Thread Index | Old Index