kern/53152: fsync error chain lossage

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/53152: fsync error chain lossage
From: dholland%eecs.harvard.edu@localhost
Date: Tue, 3 Apr 2018 21:35:00 +0000 (UTC)

>Number:         53152
>Category:       kern
>Synopsis:       fsync error chain lossage
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Apr 03 21:35:00 +0000 2018
>Originator:     dholland%NetBSD.org@localhost
>Release:        NetBSD 8.99.14 (20180402)
>Organization:
>Environment:
System: NetBSD macaran 8.99.14 NetBSD 8.99.14 (MACARAN) #48: Mon Apr 2 18:37:50 EDT 2018 dholland@macaran:/usr/src/sys/arch/amd64/compile/MACARAN amd64
Architecture: x86_64
Machine: amd64
>Description:

Pursuant to the recent discussion about broken fsync behavior in Linux
(see 
https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com#CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ%mail.gmail.com@localhost
)

I've found four problems so far:

(1) I/O errors are not reported back to fsync at all.
(2) Write errors during genfs_putpages that fail for any reason other
    than ENOMEM cause the data to be semi-silently discarded.
(3) It appears that UVM pages are marked clean when they're selected to be
    written out, not after the write succeeds; so there are a bunch of
    potential races when writes fail.
(4) It appears that write errors for buffercache buffers are semi-silently
    discarded as well.

Further details:


(1) fsync implementations (e.g. ffs_full_fsync) use vflushbuf() to do
the syncing. Unfortunately, vflushbuf() discards the error return from
VOP_PUTPAGES. So even if the error reporting chain from biodone
through putpages works (I think it does, but there are various
opportunities for things to go wrong) it gets lost at this point.

Furthermore, for non-uvm pages, the loop immediately below in
vflushbuf() checks for error only sometimes: it uses bawrite() for
blocks that belong to the vnode and waits for them to complete via
v_numoutput, which doesn't involve an error check.


(2) The chain of bufferio completion calls eventually leads to
uvm_aio_aiodone_pages(), which contains this comment:
  /*
   * process errors.  for reads, just mark the page to be freed.
   * for writes, if the error was ENOMEM, we assume this was
   * a transient failure so we mark the page dirty so that
   * we'll try to write it again later.  for all other write
   * errors, we assume the error is permanent, thus the data
   * in the page is lost.  bummer.
   */
and the code returns with the page marked clean so the fact it wasn't
written is forgotten. The error is discarded at this point; but note
that it's also extracted from the buffer by biowait() and returned.
So there is one chance to see the error; unfortunately as already
noted it's lost.


(3) There are only a small number of places where PG_CLEAN is applied;
the one that occurs on the write path is in uvn_findpage() in
uvm_vnode.c, when called with the UFP_DIRTYONLY flag.
genfs_do_putpages does this to figure out what pages to write.
Moreover, the code in uvm_aio_aiodone_pages() that retries on ENOMEM
matches this in that it explicitly removes PG_CLEAN to cause a retry.
Ideally, the page will be locked/busy/something during the I/O so the
PG_CLEAN bit cannot be observed, but it's far from clear a priori
whether that's actually true. It would be bad if e.g. another fsync of
the same file running concurrently could observe the clean flag and 
succeed incorrectly believing the page had already been written.


(4) For buffercache buffers that are written with B_ASYNC (which
includes any that go through bawrite, such as many of the ones
explicitly written by vflushbuf as noted above) biodone2() goes
through the second of its three dispatch cases, which calls brelse()
without checking for error. brelse() at that point will invalidate and
discard it.


The discards are only semi-silent because hard I/O errors print kernel
messages at the device level; but that's not really an adequate
substitute for application error reporting.

>How-To-Repeat:

Code analysis.

>Fix:

Looks unpleasant.


>Unformatted:
 David A. Holland

Prev by Date: bin/53151: mdnsd tires to send on inactive (BACKUP state) CARP IP addresses
Next by Date: PR/48395 CVS commit: src/crypto/external/bsd/netpgp/dist/src
Previous by Thread: bin/53151: mdnsd tires to send on inactive (BACKUP state) CARP IP addresses
Next by Thread: Re: kern/53152: fsync error chain lossage
Indexes:

Home | Main Index | Thread Index | Old Index