Subject: Re: Redoing file system suspension API (update)
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
List: tech-kern
Date: 06/21/2006 13:34:38
On Wed, Jun 21, 2006 at 08:02:56PM +0900, YAMAMOTO Takashi wrote:
> > > > > first of all, i tend to think filesystem snapshot thing should be done
> > > > > entirely in filesystem-dependent code.
> > > > 
> > > > Depends on what to expect from suspension.  I expect a file system state
> > > > where system calls are the atomic operations.
> > > 
> > > isn't it almost the same as VOPs?  (with some exceptions, of course)
> > 
> > And how would you explain this to a programmer/user?
> >   A suspended file system is in a state where VOPs are the atomic operations.
> >   Look at the kernel source what this might mean for your application.
> > 
> > I think it is a much cleaner way to use system calls as atomic operations.
> > 
> > Doing it inside file systems you may also lose the "no locked vnodes" property.
> 
> we should turn (most part of) vnode lock into filesystem internal as well. :)
> 
> well, i think neither syscalls or individual VOPs are appropriate
> for your purpose.  what you need is the intermediate.  ie. a set of VOPs.
> 
> for example,
> 
> int
> vn_remove(const char *path)
> {
> 
> 	lookup_parent(..., &dvp, ...);
> 
> 	vngate_enter(dvp->v_mount);
> 	lock(dvp);
> 	lookup_lastcomponent(dvp, &vp, ..);
> 	VOP_REMOVE(dvp, vp, ...);
> 	vngate_leave(dvp->v_mount);
> }

Why do you think "lookup_parent()" does not change file system data/metadata?

What if we make lookup() gate-aware?

	- add struct mount *ni_gate, *ni_dgate to struct nameidata
	- add an option KEEPGATES to namei() so namei() either leaves
	  the gates on return or keeps them if KEEPGATES is given.

and this becomes

	NDINIT(..., KEEPGATES, ...)
	namei(&nd);
	VOP_LEASE(...)
	...
	VOP_REMOVE(nd.ni_dvp, nd.ni_vp, ...);
	vngate_leave(nd.ni_dvp->v_mount);
	vngate_leave(nd.ni_vp->v_mount);

> > > > > i don't think it's desirable for each subsystems to put their own
> > > > > random hooks in these places.
> > > > 
> > > > It is possible to put the suspend/resume around calls to device
> > > > functions (d_open, d_read etc) in spec_vnops, device functions (so_receive,
> > > > so_send etc) in fifo_vnops.c, around ttywait(), selcommon() and pollcommon().
> > > > That is what I did in my first proposal.
> > > 
> > > i don't think this suspend/resume is a good idea at all.
> > 
> > We will need it for a file system external implementation.  We cannot ignore
> > gating for VCHR/VBLK vnodes as they may change meta data.  ffs_specop already
> > does this.  And they might go to long sleep holding a suspension for possibly
> > infinite time.
> 
> i think you can call vngate_leave in eg. ufsspec_read.
> yes, in this case, the caller need to ensure that it "holds"
> exactly one vngate_enter.  i don't think it's so bad.

Taken.

> > > > > > To solve the rest of 3) it adds a throttling on the first gate not involved
> > > > > > in a suspending file system.
> > > > > 
> > > > > - isn't it normal that an operation become slow when the system has
> > > > >   other activities?
> > > > 
> > > > Slow, yes. But in case of suspension the sync-to-disk becomes very slow.
> > > > Throttling other i/o reduces the time to suspension from > 5 minutes
> > > > to < 30 seconds on my test machine.
> > > 
> > > - is it true even if filesystems are backed by different disks?
> > 
> > Yes.  My test machine has root on sd0 test1..4 on sd1.  It is true for
> > the case where the load is on root and the suspension is on test1.  With
> > softdep of course.  Main problem is the softdep code is not per-mount.
> > 
> > > - why does it need the special care?
> > 
> > It solves a real problem now that may go away with updates to the softdep code
> > or the introduction of a real i/o scheduler.
> 
> it isn't clear to me why the suspension on filesystem A has a priority over
> activities on unrelated filesystem B.

Try it for yourself (on one disk if you need real problems)....

> > > > > please try to avoid putting subsystem-specific data to struct lwp.
> > > > 
> > > > If we use permanent gates we have per-thread state.  Where should this state go
> > > > if not into struct lwp?
> > > 
> > > i meant permanent gate is a bad idea.
> > 
> > Non-permanent gates have the same problem.  We must take care of long sleeps.
> 
> can you explain?
> 
> i thought
> 
> 	vngate_enter(PERMANENT)
> 	some_operations();
> 
> 	long_sleep(); /* with suspend/resume */
> 
> 	other_operations();
> 	vngate_leave_all();
> 
> could be
> 
> 	vngate_enter()
> 	some_operations();
> 	vngate_leave()
> 
> 	long_sleep(); /* without suspend/resume */
> 
> 	vngate_enter()
> 	other_operations();
> 	vngate_leave()

At least for specfs/fifofs this looks ok.

> YAMAMOTO Takashi

-- 
Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)