tech-kern: Re: Redoing file system suspension API (update)

Subject: Re: Redoing file system suspension API (update)
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
List: tech-kern
Date: 06/18/2006 20:48:07
On Sun, Jun 18, 2006 at 10:59:33PM +0900, YAMAMOTO Takashi wrote:
> thanks for taking a look of this.
> although i have many negative comments, i think it's a step forward. :-)
> 
> 
> first of all, i tend to think filesystem snapshot thing should be done
> entirely in filesystem-dependent code.

Depends on what to expect from suspension.  I expect a file system state
where system calls are the atomic operations.

> > - Instead of tracking possible long sleeps I put the suspend/resume into
> >   ltsleep().
> 
> why?

I want to suspend the gates before a thread goes to long sleep.  Ltsleep
is the point where it happens.

> i don't think it's desirable for each subsystems to put their own
> random hooks in these places.

It is possible to put the suspend/resume around calls to device
functions (d_open, d_read etc) in spec_vnops, device functions (so_receive,
so_send etc) in fifo_vnops.c, around ttywait(), selcommon() and pollcommon().
That is what I did in my first proposal.

> what happens if a filesystem itself sleeps with PCATCH?
> (maybe you can call it a bug, but we currently have such a code.)

Yes, it is a bug.  Which file system btw?

> besides, vngate_resume itself seems to sleep.
> is it safe to call it from ltsleep?

Yes, as long as we only sleep with timeout 0 and no PCATCH.

> > To solve the rest of 3) it adds a throttling on the first gate not involved
> > in a suspending file system.
> 
> - isn't it normal that an operation become slow when the system has
>   other activities?

Slow, yes. But in case of suspension the sync-to-disk becomes very slow.
Throttling other i/o reduces the time to suspension from > 5 minutes
to < 30 seconds on my test machine.

> - why you check P_SYSTEM?

I don't see the above problem (high i/o load) for any system process yet.

> > ** The new API is:
> > 
> > Using explicit enter()/leave() pairs adds much complexity so I took another
> > approach. I use two types of gates.  Normal gates need a "leave" operation.
> > Permanent gates are valid until the thread returns to user mode.
> 
> while it can make your patch smaller, i think it's actually more complex
> and harder to understand and maintain.

Where is the complexity and maintenance?  Threads either return to user through
userret() or we explicitly say vngate_leave_all().  It reduces complexity
because we dont need to track every gate.  Using only normal gates lookup/namei 
had to return two gates we have to take care of.  Doing exec_XXX makes it even
more complex to handle.  But of course it is doable.

> > void vngate_initialize(struct lwp *l)
> > 
> >   Initialize the per-thread state.  Will be freed on vngate_leave_all(..., 1).
> 
> please try to avoid putting subsystem-specific data to struct lwp.

If we use permanent gates we have per-thread state.  Where should this state go
if not into struct lwp?

> >   V_NOERROR	Panic on error.  No need for the caller to check the result.
> 
> what's the point of this?

I like style where results are not silently ignored.  Any usage of vngate_enter
without V_NOERROR and ignoring the result is a coding error.

> > Adding permanent gates to lookup(), (new) FILE_USE_GATED() and nfsrv_fhtovp()
> > covers most system calls. The various compat parts are not ready.  A scan over
> > all "FILE_USE()" should be sufficient here.  The implementation renames
> > "vfs_write_XXX" to "vfs_XXX" and adds the state pointer to "struct lwp".
> > All other changes depend on "option NEWVNGATE" to make it easy to test the
> > new API.
> 
> why you put vngate_enter into FILE_USE, rather than VOPs?

I hope to understand right.  You meant

	FILE_USE(...);
	if ((fp)->f_type == DTYPE_VNODE)
		vngate_enter(...)

instead of

	FILE_USE_GATED(...)

It is easier to read.  We could add a flag to FILE_USE() instead.

If you meant putting the gates inside the VOP_XXX functions, this cannot work.
Some VOPs need to be called with simple locks so we cannot sleep here.

> YAMAMOTO Takashi

-- 
Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)