tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

fsync error reporting

I have been going through various code paths (hence all the posts and
commits about obscure details) with an eye toward documenting what we
do and do not guarantee about files to applications... and I've gotten
to fsync.

In an ideal world, if you write some data and later when it gets
written to disk the disk says EIO, this will be reported to the
application as an EIO return when (if) the application calls fsync.
This allows, for example, databases to detect when their transaction
commits are not actually committed, and take appropriate steps like
failing over to another replica.

In the world we currently have, the chances of this working are
identically zero; the error reporting chain for fsync and for write
errors in general is completely borked.

There are several places where errors are discarded (e.g. any errors
reported by putpages are flatly ignored by vflushbuf) and other places
where errors are collected or not depending on circumstances; e.g.
again in vflushbuf I/O errors on metadata blocks are noticed only for
buffers attached to the disk vnode and not the file vnode. (And while
the comment says this identifies indirect blocks, I don't think that's
actually true in general either.)

Some of the current code paths that do check for error also give up as
soon as an error happens and skip trying to fsync the rest of the
file. This is probably not helpful; it's not necessarily clear that
it's completely wrong (since as kre points out, today once a disk
starts reporting EIO further behavior is probably undefined) but
that depends on circumstances. It's not entirely implausible that a
RAID might emit some transient EIOs during failover and then become
fully operable again a moment or two later. Also, ISCSI.

Also, some of the code paths skip over busy buffers, and not all of
them retry, meaning that blocks can be skipped, which is a fairly
serious bug if it happens.

Meanwhile, as appeared in the other thread about write errors, it
would be nice if separate processes writing to the same file either
didn't receive notification about each others' I/O errors, or if they
all received notifications about all I/O errors. (That is, it would be
useful to avoid the world where my I/O error gets reported to you and
then not to me.)

And currently there's a problem that the only way to flush the
underlying hardware-level caches is to call fsync_range and pass
FDISKSYNC. This might be POSIX (is it? man page doesn't say so) but it
doesn't necessarily seem helpful or the right way to go about handling
this issue.

Documenting what currently happens is pointless, so in what I'm
writing I would like instead to document the model we'd like to have,
and for the time being append a large asterisk that says none of it
actually works.

My thoughts on the subject are:

(1) We should guarantee that any disk-level write error that doesn't
result in an immediate EIO return from write (and as I was saying
elsewhere, I don't think that will ever happen for regular files
without O_DIRECT) results in an eventual EIO return from fsync.

(2) Even when fsync ends up returning EIO it should sync as much of
the file as it can; that is, unlike typical error paths EIO should not
result in immediately stopping and unwinding.

(3) I think the drawbacks of reporting user 1's I/O errors to user 2
(especially when you can fsync after opening with O_READONLY, so there
isn't necessarily a huge amount of trust involved) mean that we should
guarantee that I/O errors from *your* writes should be reported by
*your* call to fsync. I think it's sufficient to make this per-open
(rather than per-file-per-lwp or whatever) so the machinery can live
in struct file.

(3a) I don't think it's necessary to guarantee that I/O errors from
other people's writes won't _also_ be reported by your fsync call, but
I think any natural implementation that supports the prior guarantee
will also have this property.

(4) I'm not sure what to do about disk-level caches but I think
punting the issue to application software (especially via an obscure
and apparently not standard interface) isn't the right approach.

(5) I'm not sure if outstanding unreported I/O errors should cause
close to fail with EIO or if they should just be dropped. Given how
useless error handling on close is, it's probably a moot question.

(6) We do absolutely need to guarantee that when fsync returns
everything that process wrote is on disk, even if some other process
is in the middle of modifying some of the same buffers.

(7) I don't know if I/O errors should be counted or just latched (that
is, if there are six I/O errors, do you get six successive EIOs from
fsync, or just one?) but we should decide and commit to one model or
the other. Given per-open reporting I think latched is sufficient.

(8) I'm not convinced that there's any real value in reporting exactly
what blocks failed. Most applications have nothing useful they can do
with the information, and most of the rest (e.g. database engines)
will probably just shut down since in general you can't really

(9) We need a model for what happens to the unwritten data. Throwing
it away is clearly wrong (some may recall a furor a couple years ago
when it was discovered that Linux did this) but retrying and likely
failing on every subsequent fsync attempt isn't that useful either.
My suggestion is to allow retrying up to some arbitrary fixed number
of times and then mark the buffer broken, and provide some out-of-band
way to either discard everything (umount -f?) or start retrying again,
e.g. after manually reinserting accidentally ejected media.

(10) sync() cannot (and therefore shouldn't attempt to) report pending

As far as implementing all this... right now one of the ways errors
are discarded is via the syncer, because it has nowhere to send them.
I think there needs to be some sort of slot or bucket to write I/O
errors into, and in light of the considerations above it should
probably _live_ in struct file and then be passed down so it can be
hung on all buffers it pertains to. I suspect (but I haven't checked
if there are any fatal gotchas) that this should entirely replace I/O
error reporting via EIO returns at the bufferio level, because the
latter is scattershot and basically doesn't really work for a lot of
common cases. It should also replace I/O error reporting via b_error,
I think, as while that's *supposed* to translate into eventual EIO
returns it doesn't actually work in practice. (I'm not sure offhand
what else can appear in that slot and whether this means it can go

On the UVM side, I'm afraid IDK. I only sort of understand
genfs_putpages, and not well enough to propose structural changes.
But I imagine something similar (ideally using the same reporting
structure) is probably desirable.

The structure I'm thinking of would be something like this:

   struct eiobucket {
           TAILQ_ENTRY(eiobucket) eb_listnode;
           unsigned eb_refcount;
           unsigned eb_value;

where the listnode would be protected by the bufcache lock or maybe
the vnode interlock-lock, and the other fields would be handled with

(note that in common cases the value isn't touched and the refcount
and listnode are going to be all manipulated by the same one cpu, so
it's unlikely to become a concurrency bottleneck)

Unfortunately while it would be nice to have this _in_ struct file and
not a separate allocation, that won't work given revoke(). Hence the
refcount. The idea is to add it to a list on every buffer that gets
dirtied during a fs operation, and take it off again when the buffer's
successfully written.

Most of the other proposed guarantees are a matter of fixing up the
code paths that we have.

David A. Holland

Home | Main Index | Thread Index | Old Index