tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

PROPOSAL: Split uiomove into uiopeek, uioskip



tl;dr

I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which
together are equivalent to successful uiomove(buf, n, uio).

This allows a driver to separately:
1. transfer data into a buffer (uiopeek) first, and then
2. advance the uio (uioskip) only after determining how much of the
   data the driver was able to actually process.

Thoughts?


DETAILS

A common pattern when transferring data between a driver and userland
is unlocked uiomove(9) into a temporary buffer, and then some I/O
operation under a driver lock.  The uiomove(9) itself normally can't
be done under a driver lock because it may fault and sleep
indefinitely for I/O.

However, sometimes the I/O operation doesn't consume everything.  This
can happen in ttwrite (sys/kern/tty.c): in the non-blocking case
(flags & IO_NDELAY), it may bite off a chunk of the uio bigger than
the tty can chew without blocking.

In that case, ttwrite hacks uio->uio_resid so that the write(2) system
call reports the correct number of bytes writen so the user program
can pick up where it left off (a hack that has existed since 1990 or
earlier) -- but the uio itself becomes inconsistent, because it
doesn't roll back uio->uio_iov or uio->uio_iov[0].iov_base or anything
else.

And ktrwrite (sys/kern/kern_ktrace.c) tries to reuse the uio in the
event that fp->f_ops->fo_write returns EWOULDBLOCK, after pausing for
a moment.  This requires the uio to be consistent, which falls apart
with ttwrite, as in a litany of syzbot reports like this one:

https://syzkaller.appspot.com/bug?id=976210ed438d48ac275d77d7ebf4a086e43b5fcb

Now, ktracing to a tty is pretty weird, but the salient point is:

(a) Is ttwrite violating a contract by advancing the uio's iovs past
    the number of bytes it consumed of uio_resid?

or

(b) Is ktrwrite abusing struct fileops::fo_write by reusing uio even
    after an error?

If (a), the attached patch breaks ttwrite's uiomove into two parts:
uiopeek, to get the data into a temporary buffer, which may be then
fed byte-by-byte into ttyoutput; and uioskip, to advance the uio by as
many bytes as were actually consumed by the tty -- all of them in the
success case, but not all if we stop with EWOULDBLOCK.

If (b), we have to convince ktrwrite to reconstruct a uio that picks
up where fo_write left off in the same buffer.  Maybe we could
reconstruct it but skip the first n bytes, or take a snapshot of the
uio, but either way it would be annoying to deal with -- more fiddly
than writing, using, and testing uiopeek/uioskip which I already did.

The problem appears not to manifest in OpenBSD or FreeBSD because they
have nothing that calls fo_write in a loop with the same uio -- their
ktrace is limited to vnodes and doesn't handle EWOULDBLOCK by looping.
But their ttwrite routines (or equivalent) do leave the uio
inconsistent on error, with the same comment Mike Karels left back in
1990 in the CSRG tty.c 7.21.[*]

P.S.  I suspect this issue will also apply in the EFAULT case.  If you
      try to uiomove (say) 2 bytes in separate iov entries, and the
      first iov doesn't fault but the second iov does, uiomove will
      advance the uio by one byte but return error.

      The caller probably won't process the first byte even though
      uio_resid has been decremented.  So it will appear as though one
      byte has been transferred even though that byte was actually
      just copied to a buffer and then discarded.

      That might be a more common issue throughout the tree which
      could also be addressed by separating it into uiopeek and
      uioskip, but I haven't put further thought to whether it's a
      real issue or further effort into auditing drivers for it.

[*] https://github.com/robohack/ucb-csrg-bsd/commit/b6660865c9f41cce98f8728f7f27b34b4a231e81#diff-37ae898972ce315c35bb81ff0e209efb141a9967efef821c7e31fda8d67d7ff2R1479-R1486
diff --git a/share/man/man9/uiomove.9 b/share/man/man9/uiomove.9
index 6417ea14b684..d2f579253386 100644
--- a/share/man/man9/uiomove.9
+++ b/share/man/man9/uiomove.9
@@ -24,7 +24,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd April 26, 2010
+.Dd May 9, 2023
 .Dt UIOMOVE 9
 .Os
 .Sh NAME
@@ -34,6 +34,10 @@
 .In sys/systm.h
 .Ft int
 .Fn uiomove "void *buf" "size_t n" "struct uio *uio"
+.Ft int
+.Fn uiopeek "void *buf" "size_t n" "struct uio *uio"
+.Ft void
+.Fn uioskip "void *buf" "size_t n" "struct uio *uio"
 .Sh DESCRIPTION
 The
 .Fn uiomove
@@ -140,10 +144,35 @@ to point that much farther into the region described.
 This allows multiple calls to
 .Fn uiomove
 to easily be used to fill or drain the region of data.
+.Pp
+The
+.Fn uiopeek
+function copies up to
+.Fa n
+bytes of data without updating
+.Fa uio ;
+the
+.Fn uioskip
+function updates
+.Fa uio
+without copying any data, and is guaranteed never to sleep or fault
+even if the buffers are in userspace and memory access via
+.Fn uiomove
+or
+.Fn uiopeek
+would trigger paging.
+A successful
+.Fn uiomove buf n uio
+call is equivalent to a successful
+.Fn uiopeek buf n uio
+followed by
+.Fn uioskip n uio .
 .Sh RETURN VALUES
 Upon successful completion,
 .Fn uiomove
-returns 0.
+and
+.Fn uiopeek
+return 0.
 If a bad address is encountered,
 .Er EFAULT
 is returned.
diff --git a/sys/kern/subr_copy.c b/sys/kern/subr_copy.c
index cd934032d33c..02834432d31d 100644
--- a/sys/kern/subr_copy.c
+++ b/sys/kern/subr_copy.c
@@ -166,6 +166,93 @@ uiomove_frombuf(void *buf, size_t buflen, struct uio *uio)
 	return (uiomove((char *)buf + offset, buflen - offset, uio));
 }
 
+int
+uiopeek(void *buf, size_t n, struct uio *uio)
+{
+	struct vmspace *vm = uio->uio_vmspace;
+	struct iovec *iov;
+	size_t cnt;
+	int error = 0;
+	char *cp = buf;
+	size_t resid = uio->uio_resid;
+	int iovcnt = uio->uio_iovcnt;
+	char *base;
+	size_t len;
+
+	KASSERT(uio->uio_rw == UIO_READ || uio->uio_rw == UIO_WRITE);
+
+	if (n == 0 || resid == 0)
+		return 0;
+	iov = uio->uio_iov;
+	base = iov->iov_base;
+	len = iov->iov_len;
+
+	while (n > 0 && resid > 0) {
+		KASSERT(iovcnt > 0);
+		cnt = len;
+		if (cnt == 0) {
+			KASSERT(iovcnt > 1);
+			iov++;
+			iovcnt--;
+			base = iov->iov_base;
+			len = iov->iov_len;
+			continue;
+		}
+		if (cnt > n)
+			cnt = n;
+		if (!VMSPACE_IS_KERNEL_P(vm)) {
+			preempt_point();
+		}
+
+		if (uio->uio_rw == UIO_READ) {
+			error = copyout_vmspace(vm, cp, base, cnt);
+		} else {
+			error = copyin_vmspace(vm, base, cp, cnt);
+		}
+		if (error) {
+			break;
+		}
+		base += cnt;
+		len -= cnt;
+		resid -= cnt;
+		cp += cnt;
+		KDASSERT(cnt <= n);
+		n -= cnt;
+	}
+
+	return error;
+}
+
+void
+uioskip(size_t n, struct uio *uio)
+{
+	struct iovec *iov;
+	size_t cnt;
+
+	KASSERTMSG(n <= uio->uio_resid, "n=%zu resid=%zu", n, uio->uio_resid);
+
+	KASSERT(uio->uio_rw == UIO_READ || uio->uio_rw == UIO_WRITE);
+	while (n > 0 && uio->uio_resid) {
+		KASSERT(uio->uio_iovcnt > 0);
+		iov = uio->uio_iov;
+		cnt = iov->iov_len;
+		if (cnt == 0) {
+			KASSERT(uio->uio_iovcnt > 1);
+			uio->uio_iov++;
+			uio->uio_iovcnt--;
+			continue;
+		}
+		if (cnt > n)
+			cnt = n;
+		iov->iov_base = (char *)iov->iov_base + cnt;
+		iov->iov_len -= cnt;
+		uio->uio_resid -= cnt;
+		uio->uio_offset += cnt;
+		KDASSERT(cnt <= n);
+		n -= cnt;
+	}
+}
+
 /*
  * Give next character to user as result of read.
  */
diff --git a/sys/kern/tty.c b/sys/kern/tty.c
index c92de98ce3d7..4da3e496bfd1 100644
--- a/sys/kern/tty.c
+++ b/sys/kern/tty.c
@@ -2229,13 +2229,13 @@ ttwrite(struct tty *tp, struct uio *uio, int flag)
 {
 	u_char		*cp;
 	struct proc	*p;
-	int		cc, ce, i, hiwat, error;
+	int		cc, cc0, ce, i, hiwat, error;
 	u_char		obuf[OBUFSIZ];
 
 	cp = NULL;
 	hiwat = tp->t_hiwat;
 	error = 0;
-	cc = 0;
+	cc0 = cc = 0;
  loop:
 	mutex_spin_enter(&tty_lock);
 	if (!CONNECTED(tp)) {
@@ -2300,9 +2300,10 @@ ttwrite(struct tty *tp, struct uio *uio, int flag)
 		 * leftover from last time.
 		 */
 		if (cc == 0) {
-			cc = uimin(uio->uio_resid, OBUFSIZ);
+			uioskip(cc0, uio);
+			cc0 = cc = uimin(uio->uio_resid, OBUFSIZ);
 			cp = obuf;
-			error = uiomove(cp, cc, uio);
+			error = uiopeek(cp, cc, uio);
 			if (error) {
 				cc = 0;
 				goto out;
@@ -2373,13 +2374,9 @@ ttwrite(struct tty *tp, struct uio *uio, int flag)
 	}
 
  out:
-	/*
-	 * If cc is nonzero, we leave the uio structure inconsistent, as the
-	 * offset and iov pointers have moved forward, but it doesn't matter
-	 * (the call will either return short or restart with a new uio).
-	 */
 	KASSERTMSG(error || cc == 0, "error=%d cc=%d", error, cc);
-	uio->uio_resid += cc;
+	KASSERTMSG(cc0 >= cc, "cc0=%d cc=%d", cc0, cc);
+	uioskip(cc0 - cc, uio);
 	return (error);
 
  overfull:
diff --git a/sys/sys/systm.h b/sys/sys/systm.h
index 3585a0e9fbea..b59536f6c079 100644
--- a/sys/sys/systm.h
+++ b/sys/sys/systm.h
@@ -620,6 +620,8 @@ void	trace_exit(register_t, const struct sysent *, const void *,
 
 int	uiomove(void *, size_t, struct uio *);
 int	uiomove_frombuf(void *, size_t, struct uio *);
+int	uiopeek(void *, size_t, struct uio *);
+void	uioskip(size_t, struct uio *);
 
 #ifdef _KERNEL
 int	setjmp(label_t *) __returns_twice;


Home | Main Index | Thread Index | Old Index