Subject: Re: nfs servers and 5 minute VOP_READ's
To: Bill Sommerfeld <sommerfeld@orchard.arlington.ma.us>
From: Bill Studenmund <wrstuden@nas.nasa.gov>
List: tech-kern
Date: 03/15/1999 14:51:10
On Sat, 13 Mar 1999, Bill Sommerfeld wrote:

> Yup.  You're hosed without a bunch of rework.  The other nfsds
> probably pile up on a vn_lock() on the vnode being read/written to
> deep in the guts of some fhtovp routine.

Yep. In the call by the lower-fs's fhtovp() routine to its VFS_VGET()
routine.

> It's my understanding that HP had to do a lot of work on the HP-UX NFS
> server to get it to play nice with a magneto-optic jukebox they were
> hawking a few years ago which took maybe 10-20 seconds to do a media
> change.  (I'm not really sure exactly what they did, though).

One big problem there would be if the clients were trying to use multiple
media in the juke box at the same time. You could get them tugging on each
other, with 10-20 second penalties. In the case here, there is no
competition between users - restoring one file won't cause another in-use
file to go off line the way removing media would. :-)

> This is a fundamental problem with RPC-based services which tie up a
> thread for the duration of the service routine.  There are various
> ways around this..
> 
>  - create more worker threads when they're all busy.
> 
> Included only because it's the obvious answer to some, and is Just
> Wrong -- stacks are too expensive to allocate on demand..

I think it's wrong here also. :-)

>  - thread pools/intelligent allocation/queueing of requests to threads
> 
> This would involves adding some intelligent pre-dispatching into the
> NFS server, which looks at enough of the request to sort it into the
> right queue based on what filehandle/file system/etc is involved.
> 
> Certain operations could conceivably be answered directly at netisr
> level; others could be sorted into "fast" and "slow" queues and
> handled by separate pools of threads.
> 
> We already have part of the infrastructure for this in our NFS server;
> see nfsrv_getstream() and nfsrv_wakenfsd() in sys/nfs/nfs_socket.c, as
> well as nfsrv_writegather() in nfs_serv.c
> 
>  - Then there's the loony rewrite-the-world approach.. rewriting large
> chunks of the filesystem to be asynchronous-- what Scheme and ML
> hackers call "continuation-passing style".  In short, routines which
> can block are changed so that instead of returning a value, they
> instead get a continuation routine (and cookie pointer, since C
> doesn't support closures). When the operation completes, instead of
> returning a value, the continuation routine is called.
> 
> We already support this style at various parts of the system.. for
> instance, the interrupt side of drivers is written this way, and
> `struct buf' includes a `biodone' routine which gets called when the
> I/O completes.  timeout() also works this way.
> 
> Bringing this style of coding up out of interrupt level into the
> filesystem would be painful.. and I suspect everyone would hate you
> for it.  Also, absent compiler hacks to do tail-call elimination, it
> would probably significantly increase stack usage in the cases where
> you don't actually have to wait.  Debugging would get more painful
> since you'd have to manually trace back where a request came from
> through multiple layers of continuation routines and cookies.
> 
> However, it would expose a lot more I/O parallelism, which the
> performance guys would really like..
> 
> > Would having the nfs server execute all its reads & writes with IO_NDELAY
> > (and teaching my stuff to return EWOULDBLOCK) make sense?
> 
> No, I think it's harder than that.  What does the NFS server do when
> it gets an EWOULDBLOCK?  do you put the request back in a queue to try
> again later?  (ewww, polling).  It makes more sense to put the
> in-progress request aside somewhere until you get some indication that
> it completed.

Thinking about it more, and looking at the code, the current solution is
that returning EWOULDBLOCK causes the request to be dropped. Then the
client times out, and tries again. If enough data has been restored, a
read might now proceed. Else back comes an EWOULDBLOCK.

> What about when VOP_LOOKUP, or VOP_GETATTR block because directories
> or the metadata were also migrated out to cold storage?

In this scheme, directories aren't shipped off, and the metadata is stored
in the inodes. Thus it's always available.

> And how does an NFS client tell the difference between a server which
> is off the air, and a server which is frantically hunting for the
> right tape?  You might be able to do something using NQNFS leases here
> so that clients which support NQNFS are in a little better shape...

Not sure. Is there a way that the server can say, "I got your request,
but I'm too busy now, try again in a little bit." ??

That way all of the waiting is done on the client.

Take care,

Bill