Subject: Re: nfs servers and 5 minute VOP_READ's
To: Bill Studenmund <>
From: Bill Sommerfeld <>
List: tech-kern
Date: 03/13/1999 10:30:16
> Say I have a file for which a VOP_READ will take five minutes to complete
> (I have to restore some of its contents from tape).
> Say this file lives in an nfs-exported fs.
> How badly would this break an nfs server? The first nfsd to read the file
> would sleep for five minutes. What would happen then? Would the client
> retry, and possibly send another process to sleep (possibly getting all of
> them)?

Yup.  You're hosed without a bunch of rework.  The other nfsds
probably pile up on a vn_lock() on the vnode being read/written to
deep in the guts of some fhtovp routine.

It's my understanding that HP had to do a lot of work on the HP-UX NFS
server to get it to play nice with a magneto-optic jukebox they were
hawking a few years ago which took maybe 10-20 seconds to do a media
change.  (I'm not really sure exactly what they did, though).

This is a fundamental problem with RPC-based services which tie up a
thread for the duration of the service routine.  There are various
ways around this..

 - create more worker threads when they're all busy.

Included only because it's the obvious answer to some, and is Just
Wrong -- stacks are too expensive to allocate on demand..

 - thread pools/intelligent allocation/queueing of requests to threads

This would involves adding some intelligent pre-dispatching into the
NFS server, which looks at enough of the request to sort it into the
right queue based on what filehandle/file system/etc is involved.

Certain operations could conceivably be answered directly at netisr
level; others could be sorted into "fast" and "slow" queues and
handled by separate pools of threads.

We already have part of the infrastructure for this in our NFS server;
see nfsrv_getstream() and nfsrv_wakenfsd() in sys/nfs/nfs_socket.c, as
well as nfsrv_writegather() in nfs_serv.c

 - Then there's the loony rewrite-the-world approach.. rewriting large
chunks of the filesystem to be asynchronous-- what Scheme and ML
hackers call "continuation-passing style".  In short, routines which
can block are changed so that instead of returning a value, they
instead get a continuation routine (and cookie pointer, since C
doesn't support closures). When the operation completes, instead of
returning a value, the continuation routine is called.

We already support this style at various parts of the system.. for
instance, the interrupt side of drivers is written this way, and
`struct buf' includes a `biodone' routine which gets called when the
I/O completes.  timeout() also works this way.

Bringing this style of coding up out of interrupt level into the
filesystem would be painful.. and I suspect everyone would hate you
for it.  Also, absent compiler hacks to do tail-call elimination, it
would probably significantly increase stack usage in the cases where
you don't actually have to wait.  Debugging would get more painful
since you'd have to manually trace back where a request came from
through multiple layers of continuation routines and cookies.

However, it would expose a lot more I/O parallelism, which the
performance guys would really like..
> Would having the nfs server execute all its reads & writes with IO_NDELAY
> (and teaching my stuff to return EWOULDBLOCK) make sense?

No, I think it's harder than that.  What does the NFS server do when
it gets an EWOULDBLOCK?  do you put the request back in a queue to try
again later?  (ewww, polling).  It makes more sense to put the
in-progress request aside somewhere until you get some indication that
it completed.

What about when VOP_LOOKUP, or VOP_GETATTR block because directories
or the metadata were also migrated out to cold storage?

And how does an NFS client tell the difference between a server which
is off the air, and a server which is frantically hunting for the
right tape?  You might be able to do something using NQNFS leases here
so that clients which support NQNFS are in a little better shape...

				- Bill