tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: PROPOSAL: Split uiomove into uiopeek, uioskip



> Date: Tue, 9 May 2023 23:03:27 -0400 (EDT)
> From: Mouse <mouse%Rodents-Montreal.ORG@localhost>
> 
> > (In general, erroring in I/O is a whole additional can of worms; it's
> > wrong to not report back however much work happened before the error
> > when it can't be undone, but also impossible to both report work and
> > raise an error.  [...])
> 
> To the extent that it's impossible, it's impossible only because the
> APIs provided by the kernel to userland don't have any way to represent
> such a thing.  This would border on trivial to fix, except that it
> would be difficult to get much userland code to use the resulting APIs
> because of their lack of portability to other UNIX variants.

write/writev/pwrite(2) could return the number of bytes it actually
did transfer before the fault, unless it is zero in which case it
could fail with EFAULT.

It already behaves this way for EINTR and EAGAIN/EWOULDBLOCK -- but
not for EFAULT:

https://nxr.netbsd.org/xref/src/sys/kern/sys_generic.c?r=1.134#478

Strikes me as a bug that it doesn't do this for EFAULT -- and I'm not
sure it's right to fail after partial progress with _any_ error code,
except possibly EIO if metadata updates failed.

I don't see clear guidance in POSIX on the subject.

- The EINTR and EAGAIN/EWOULDBLOCK behaviour is mandated:

  `If write() is interrupted by a signal before it writes any data, it
   shall return -1 with errno set to [EINTR].'

  `Write requests to a pipe or FIFO ... If the O_NONBLOCK flag is
   set.... A write request for more than {PIPE_BUF} bytes ... When at
   least one byte can be written, transfer what it can and return the
   number of bytes written.'

- Without O_NONBLOCK, the DESCRIPTION section says `on normal
  completion it shall return nbyte', but a fault is not normal
  completion.

- The RATIONALE section says `Partial and deferred writes are only
  possible with O_NONBLOCK set' and `There is no exception regarding
  partial writes when O_NONBLOCK is set', but that's inconsistent with
  the behaviour mandated for interruption by a signal.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html

The attached test program successfully writes 65536 bytes to a pipe in
a single writev(2) call that fails and reports no progress, in a
NetBSD 10ish kernel:

$ ./writefault | cat >foo
space=65536
nwrit=-1 errno=14
$ stat -f %z foo
65536

A similar program on macOS 13.3 (using hard-coded space instead of
ioctl(FIONSPACE) which seems to be missing from macOS) exhibits the
same behaviour.  But a similar program on Linux 4.15 (using
fpathconf(_PC_PIPE_BUF)) returns the partial number of bytes written
(and anything less than fpathconf(_PC_PIPE_BUF) fails without writing
anything to foo).

_If_ write/writev/pwrite(2) were modified to handle EFAULT like it
currently handles EINTR and EAGAIN/EWOULDBLOCK and return partial
progress, then the current uiomove(9) semantics would give the wrong
indication of progress in the event of fault -- too much progress,
rather than too little.  In contrast, uiopeek/uioskip would allow
pipe_write to report exactly the number of bytes that it actually made
available to the reader.
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syslimits.h>
#include <sys/uio.h>

#include <assert.h>
#include <err.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define	HOWMANY(X, N)	(((X) + ((N) - 1))/(N))

int
main(void)
{
	const int PAGE_SIZE = sysconf(_SC_PAGESIZE);
	int space;
	struct iovec iov[2];
	size_t allocsize;
	void *pg;
	ssize_t nwrit;

	if (ioctl(STDOUT_FILENO, FIONSPACE, &space) == -1)
		err(1, "space");
	space *= 4;		/* BIG_PIPE_SIZE */
	fprintf(stderr, "space=%d\n", space);

	allocsize = (HOWMANY(space, PAGE_SIZE) + 1) * PAGE_SIZE;
	pg = mmap(NULL, allocsize, PROT_READ|PROT_WRITE, MAP_ANON, -1, 0);
	if (pg == MAP_FAILED)
		err(1, "mmap");
	memset(pg, 0x53, allocsize);
	if (mprotect(pg + allocsize - PAGE_SIZE, PAGE_SIZE, PROT_NONE) == -1)
		err(1, "mprotect");

	iov[0].iov_base = pg;
	iov[0].iov_len = space;
	iov[1].iov_base = pg + allocsize - PAGE_SIZE;
	iov[1].iov_len = 1;
	errno = 0;
	nwrit = writev(STDOUT_FILENO, iov, __arraycount(iov));
	fprintf(stderr, "nwrit=%zd errno=%d\n", nwrit, errno);

	return 0;
}


Home | Main Index | Thread Index | Old Index