Re: PROPOSAL: Split uiomove into uiopeek, uioskip

To: tech-kern%netbsd.org@localhost
Subject: Re: PROPOSAL: Split uiomove into uiopeek, uioskip
From: David Holland <dholland-tech%netbsd.org@localhost>
Date: Thu, 11 May 2023 01:29:49 +0000

On Wed, May 10, 2023 at 10:11:31AM +0000, Taylor R Campbell wrote:
 > > > (In general, erroring in I/O is a whole additional can of worms; it's
 > > > wrong to not report back however much work happened before the error
 > > > when it can't be undone, but also impossible to both report work and
 > > > raise an error.  [...])
 > > 
 > > To the extent that it's impossible, it's impossible only because the
 > > APIs provided by the kernel to userland don't have any way to represent
 > > such a thing.  This would border on trivial to fix, except that it
 > > would be difficult to get much userland code to use the resulting APIs
 > > because of their lack of portability to other UNIX variants.

Since write(2) is one of the oldest interfaces in Unix the chances of
any change taking hold are vanishingly small...

 > write/writev/pwrite(2) could return the number of bytes it actually
 > did transfer before the fault, unless it is zero in which case it
 > could fail with EFAULT.

Yes; this is really the only correct option. The same goes for EIO,
and also other conditions that can arise in the middle of an I/O, like
EDQUOT, or given nfs, ECONNRESET. Ultimately if you actually do some
I/O the count reflecting what you did needs to be returned. Otherwise
any application response other than "give up entirely" can result in
duplicating or losing data.

Since there are at least some errors that aren't sticky (that is,
might not repeat if the application restarts the rest of the I/O, and
not repeating doesn't automatically mean that the condition cleared
and no longer matters), you really also want to not lose them. As I
recall the conclusion from past discussions is that the only really
suitable approach, messy though it is, is to stash the error in the
struct file (or somewhere) and then return it immediately on the next
operation. This applies at least to EIO and maybe others.

(For EFAULT, if it doesn't repeat it means the condition cleared, and
this is expected in the case of a generational G/C's write barriers
and similar usages. So just dropping it in the case of partial success
is ok.)

Note though that returning short counts without an error can also
violate expectations about short counts, interruption, and fast vs.
slow devices, but none of that is very well defined so it is almost
certainly a better compromise than returning an error after
successfully doing I/O.

The question that comes to my mind is whether it's better as an
interface for writes (and reads) to return errors along with a
uio_resid change, or whether we should distinguish failure of the
operation (stuff like ENXIO/ENODEV or possibly ESTALE) from failure of
the I/O, and return the latter separately in struct uio.

The latter seems kind of preferable to me (because it maintains the
expectation that if you get a failure return nothing happened) but
it's a big architectural change.

All of this is not _independent_ of fixing uiomove callers, because
they're where EIO gets generated and some of them have some
flexibility, and so we need a model for the error handling to update
them correctly, but it's largely orthogonal to the original problem of
incorrectly rolling back partial uiomoves. :-(

(Examples of flexibility: if you copy 32 bytes out of a uio into a
scratch buffer and then get EFAULT, you haven't actually done anything
yet, so you can leave uio_resid alone and let the enclosing write call
ultimately fail with EFAULT. But if you copied the same 32 bytes over
a buffer-cache buffer, for example, you can't undo that so you have to
report it. Conversely, if you read 32 bytes out of a tty you can't put
it back and have to report it, but if you read 32 bytes out of a
buffer cache buffer it'll still be there later and you haven't
actually done anything yet.)

 > Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not
 > sure it's right to fail after partial progress with _any_ error code,
 > except possibly EIO if metadata updates failed.

It's not. (and I don't think EIO is an exception)

 > I don't see clear guidance in POSIX on the subject.

shocking :-)

 > The attached test program successfully writes 65536 bytes to a pipe in
 > a single writev(2) call that fails and reports no progress, in a
 > NetBSD 10ish kernel:

That's definitely a bug...

-- 
David A. Holland
dholland%netbsd.org@localhost

Follow-Ups:
- Re: PROPOSAL: Split uiomove into uiopeek, uioskip
  - From: Mouse

References:
- Re: PROPOSAL: Split uiomove into uiopeek, uioskip
  - From: Mouse
- Re: PROPOSAL: Split uiomove into uiopeek, uioskip
  - From: Taylor R Campbell

Prev by Date: Re: PROPOSAL: Split uiomove into uiopeek, uioskip
Next by Date: Re: PROPOSAL: config_* with device_t references
Previous by Thread: Re: PROPOSAL: Split uiomove into uiopeek, uioskip
Next by Thread: Re: PROPOSAL: Split uiomove into uiopeek, uioskip
Indexes:

Home | Main Index | Thread Index | Old Index