Subject: Re: Redoing file system suspension API (update)
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
From: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
List: tech-kern
Date: 06/20/2006 12:03:17
On Tue, Jun 20, 2006 at 06:22:12PM +0900, YAMAMOTO Takashi wrote:
> > > first of all, i tend to think filesystem snapshot thing should be done
> > > entirely in filesystem-dependent code.
> > 
> > Depends on what to expect from suspension.  I expect a file system state
> > where system calls are the atomic operations.
> 
> isn't it almost the same as VOPs?  (with some exceptions, of course)

And how would you explain this to a programmer/user?
  A suspended file system is in a state where VOPs are the atomic operations.
  Look at the kernel source what this might mean for your application.

I think it is a much cleaner way to use system calls as atomic operations.

Doing it inside file systems you may also lose the "no locked vnodes" property.

> > > i don't think it's desirable for each subsystems to put their own
> > > random hooks in these places.
> > 
> > It is possible to put the suspend/resume around calls to device
> > functions (d_open, d_read etc) in spec_vnops, device functions (so_receive,
> > so_send etc) in fifo_vnops.c, around ttywait(), selcommon() and pollcommon().
> > That is what I did in my first proposal.
> 
> i don't think this suspend/resume is a good idea at all.

We will need it for a file system external implementation.  We cannot ignore
gating for VCHR/VBLK vnodes as they may change meta data.  ffs_specop already
does this.  And they might go to long sleep holding a suspension for possibly
infinite time.

> > > what happens if a filesystem itself sleeps with PCATCH?
> > > (maybe you can call it a bug, but we currently have such a code.)
> > 
> > Yes, it is a bug.  Which file system btw?
> 
> lfs and nfs, at least.  you can grep. :-)
> 
> > > > To solve the rest of 3) it adds a throttling on the first gate not involved
> > > > in a suspending file system.
> > > 
> > > - isn't it normal that an operation become slow when the system has
> > >   other activities?
> > 
> > Slow, yes. But in case of suspension the sync-to-disk becomes very slow.
> > Throttling other i/o reduces the time to suspension from > 5 minutes
> > to < 30 seconds on my test machine.
> 
> - is it true even if filesystems are backed by different disks?

Yes.  My test machine has root on sd0 test1..4 on sd1.  It is true for
the case where the load is on root and the suspension is on test1.  With
softdep of course.  Main problem is the softdep code is not per-mount.

> - why does it need the special care?

It solves a real problem now that may go away with updates to the softdep code
or the introduction of a real i/o scheduler.

> > > - why you check P_SYSTEM?
> > 
> > I don't see the above problem (high i/o load) for any system process yet.
> 
> checking P_SYSTEM is not an appropriate way to see if a process can involve
> high i/o load.
> 
> eg.
> 	- dmover software-backend.
> 	- we might make nfsd a real kernel thread at some point.

It is ok for me to remove this test.

> > > > ** The new API is:
> > > > 
> > > > Using explicit enter()/leave() pairs adds much complexity so I took another
> > > > approach. I use two types of gates.  Normal gates need a "leave" operation.
> > > > Permanent gates are valid until the thread returns to user mode.
> > > 
> > > while it can make your patch smaller, i think it's actually more complex
> > > and harder to understand and maintain.
> > 
> > Where is the complexity and maintenance?
> 
> it introduces one more thing which should be considered whenever you do
> lwp-switch.  it seems complex to me.
> i can't believe putting vfs code into ltsleep is a good idea.

This is the point in execution where per-thread state is updated.  I see
nothing wrong here.

> > > please try to avoid putting subsystem-specific data to struct lwp.
> > 
> > If we use permanent gates we have per-thread state.  Where should this state go
> > if not into struct lwp?
> 
> i meant permanent gate is a bad idea.

Non-permanent gates have the same problem.  We must take care of long sleeps.

> > > >   V_NOERROR	Panic on error.  No need for the caller to check the result.
> > > 
> > > what's the point of this?
> > 
> > I like style where results are not silently ignored.  Any usage of vngate_enter
> > without V_NOERROR and ignoring the result is a coding error.
> 
> V_NOERROR in your patch mean that you are sure no error happens in
> these places?

Yes.

> > > why you put vngate_enter into FILE_USE, rather than VOPs?
> 
> > If you meant putting the gates inside the VOP_XXX functions, this cannot work.
> 
> i meant this.
> 
> > Some VOPs need to be called with simple locks so we cannot sleep here.
> 
> do you mean getpages/putpages?
> you can deal with them differently.

VOP_LOCK, VOP_GETPAGES, VOP_PUTPAGES at least.  Dealing differently means
"going up" until it is safe to sleep.  For VOP_LOCK at least this is a
nightmare.

> 
> YAMAMOTO Takashi

-- 
Juergen Hannken-Illjes - hannken@eis.cs.tu-bs.de - TU Braunschweig (Germany)