Re: partial failures in write(2) (and read(2))

To: Rhialto <rhialto%falu.nl@localhost>
Subject: Re: partial failures in write(2) (and read(2))
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Tue, 16 Feb 2021 17:29:00 +0700

    Date:        Mon, 15 Feb 2021 23:18:33 +0100
    From:        Rhialto <rhialto%falu.nl@localhost>
    Message-ID:  <YCrzOZ0YIiY9q96F%falu.nl@localhost>

  | A system call with error can return with the carry set and the error and
  | short count returned in a separate registers. The carry bit is how
  | errors used to be indicated since at least V7 (even V6?) anyway.

Earlier than v6, this dates back to when much of the system was
written in assembly code (including many of the utilities).

The issue isn't how to return multiple values from the kernel, that's
easy, we even have standard sys calls (like pipe()) which do that
routinely.

The problem is that the definition of write() (and most other system
calls) is that they don't affect errno unless there is an error, and
if there is an error, they return -1 (which leaves no place to return
a short count as well).   This all actually happens in the libc stub.

We could, of course, invent new interfaces (a write variant with an
extra pointer to length written arg perhaps, or where the length arg
is a pointer to a size_t and that is read and then written with either
the amount written, or the amount not written).

But I don't believe that any of this is needed, or desirable.

We should first make sure that we do what POSIX requires, and simply
return a short write count (and no error) in the cases where that
should happen (out of space, over quota, exceeding file size limit,
and writing any more would block and O_NONBLOCK is set, more?).

In the other error cases we should simply leave things alone and
accept it - it is the way unix always has been, and we have survived.
If we have a drive returning I/O errors (on writes), do we really
expect that earlier data written will have been written correctly?
Do you want to rely upon that?    It might have been possible once,
when drives were stupid, and simply wrote sectors in the order
presented, but with modern drives, with internal caches, which
write the data in any order they like, when they like, and do block
remapping when a sector goes bad, I wouldn't trust anything on
the drive once it starts saying write failed.   Pretending that
the first 8K of a 16KB write worked, and there was an I/O error
after that is folly.   It may easily have been that the 2nd 8K
block was written, and the first one gave up in error, eventually.
Some of the data intended to be written may have been written, but
we have no sane way to work out what (again, entire new interfaces
could allow the info to be returned, but to what point?  Who would
ever write code to make use of that info?)

It's even worse for the remaining cases, where the error is caused
by broken software (either a broken kernel doing insane things, or
a broken application asking to write data from memory it does not
own, etc).   Nothing can be assumed reliable in cases like that.

So, let's all forget fanciful interface redesigns, fix whatever we
need to fix to make things work the way they are supposed to work
(if there is anything) and leave the rest as "the world just broke"
type territory.

kre

Follow-Ups:
- Re: partial failures in write(2) (and read(2))
  - From: David Holland

References:
- Re: partial failures in write(2) (and read(2))
  - From: Rhialto
- partial failures in write(2) (and read(2))
  - From: David Holland
- Re: partial failures in write(2) (and read(2))
  - From: Mouse
- Re: partial failures in write(2) (and read(2))
  - From: Thor Lancelot Simon
- Re: partial failures in write(2) (and read(2))
  - From: John Franklin

Prev by Date: Re: partial failures in write(2) (and read(2))
Next by Date: fsync_range and O_RDONLY
Previous by Thread: Re: partial failures in write(2) (and read(2))
Next by Thread: Re: partial failures in write(2) (and read(2))
Indexes:

Home | Main Index | Thread Index | Old Index