Subject: Re: Redoing file system suspension API (update)
To: None <hannken@eis.cs.tu-bs.de>
From: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
List: tech-kern
Date: 06/21/2006 20:02:56
> > > > first of all, i tend to think filesystem snapshot thing should be done
> > > > entirely in filesystem-dependent code.
> > > 
> > > Depends on what to expect from suspension.  I expect a file system state
> > > where system calls are the atomic operations.
> > 
> > isn't it almost the same as VOPs?  (with some exceptions, of course)
> 
> And how would you explain this to a programmer/user?
>   A suspended file system is in a state where VOPs are the atomic operations.
>   Look at the kernel source what this might mean for your application.
> 
> I think it is a much cleaner way to use system calls as atomic operations.
> 
> Doing it inside file systems you may also lose the "no locked vnodes" property.

we should turn (most part of) vnode lock into filesystem internal as well. :)

well, i think neither syscalls or individual VOPs are appropriate
for your purpose.  what you need is the intermediate.  ie. a set of VOPs.

for example,

int
vn_remove(const char *path)
{

	lookup_parent(..., &dvp, ...);

	vngate_enter(dvp->v_mount);
	lock(dvp);
	lookup_lastcomponent(dvp, &vp, ..);
	VOP_REMOVE(dvp, vp, ...);
	vngate_leave(dvp->v_mount);
}

> > > > i don't think it's desirable for each subsystems to put their own
> > > > random hooks in these places.
> > > 
> > > It is possible to put the suspend/resume around calls to device
> > > functions (d_open, d_read etc) in spec_vnops, device functions (so_receive,
> > > so_send etc) in fifo_vnops.c, around ttywait(), selcommon() and pollcommon().
> > > That is what I did in my first proposal.
> > 
> > i don't think this suspend/resume is a good idea at all.
> 
> We will need it for a file system external implementation.  We cannot ignore
> gating for VCHR/VBLK vnodes as they may change meta data.  ffs_specop already
> does this.  And they might go to long sleep holding a suspension for possibly
> infinite time.

i think you can call vngate_leave in eg. ufsspec_read.
yes, in this case, the caller need to ensure that it "holds"
exactly one vngate_enter.  i don't think it's so bad.

> > > > > To solve the rest of 3) it adds a throttling on the first gate not involved
> > > > > in a suspending file system.
> > > > 
> > > > - isn't it normal that an operation become slow when the system has
> > > >   other activities?
> > > 
> > > Slow, yes. But in case of suspension the sync-to-disk becomes very slow.
> > > Throttling other i/o reduces the time to suspension from > 5 minutes
> > > to < 30 seconds on my test machine.
> > 
> > - is it true even if filesystems are backed by different disks?
> 
> Yes.  My test machine has root on sd0 test1..4 on sd1.  It is true for
> the case where the load is on root and the suspension is on test1.  With
> softdep of course.  Main problem is the softdep code is not per-mount.
> 
> > - why does it need the special care?
> 
> It solves a real problem now that may go away with updates to the softdep code
> or the introduction of a real i/o scheduler.

it isn't clear to me why the suspension on filesystem A has a priority over
activities on unrelated filesystem B.

> > > > please try to avoid putting subsystem-specific data to struct lwp.
> > > 
> > > If we use permanent gates we have per-thread state.  Where should this state go
> > > if not into struct lwp?
> > 
> > i meant permanent gate is a bad idea.
> 
> Non-permanent gates have the same problem.  We must take care of long sleeps.

can you explain?

i thought

	vngate_enter(PERMANENT)
	some_operations();

	long_sleep(); /* with suspend/resume */

	other_operations();
	vngate_leave_all();

could be

	vngate_enter()
	some_operations();
	vngate_leave()

	long_sleep(); /* without suspend/resume */

	vngate_enter()
	other_operations();
	vngate_leave()

YAMAMOTO Takashi