Subject: Re: [tls@tjls.com: Question: various bugs in sync()?]
To: None <tls@rek.tjls.com>
From: Kirk McKusick <mckusick@McKusick.COM>
List: tech-kern
Date: 01/15/1999 11:56:50
While debugging something unrelated I believe I've found multiple bugs in
sync(). I would like to be persuaded that this is not so.
Bug #1: sync(2) returns before all data is, in fact, on disk.
I'm not sure this is a bug, but I'd like someone to explain to me
why it's not. sys_sync walks the list of mounted filesystems,
vfs_busy()ing each and then calling VFS_SYNC(mp, MNT_NOWAIT, ...)
(vfs_syscalls.c line 528) on each.
Taking FFS as an example, VFS_SYNC() is ffs_sync, which walks the
list of vnodes for the filesystem and VOP_FSYNC's each; MNT_NOWAIT
implies !FSYNC_WAIT on the VOP_FSYNC. (ffs_vfsops.c line 822).
VOP_FSYNC for FFS is genfs_fsync. genfs_fsync calls vflushbuf()
with a "sync" argument from the presence/absence of FSYNC_WAIT
(genfs_vnops.c line 87). vflushbuf() walks the list of dirty
buffers for the vnode, scheduling asynchronous writes with
bawrite() because "sync" isn't set. (vfs_subr.c, line 605).
Conclusion: sync() returns once I/O for all delayed writes has
been scheduled, but not completed. That is, it converts
delayed writes into asynchronous writes, which usually
complete after it exits.
I guess this is the traditional semantic. However, it
doesn't agree with what the sync(8) manual page says
happens, and the behaviour is mentioned as a bug in the
sync(2) manual page. What do relevant standards require?
I realize that changing the MNT_NOWAIT to MNT_WAIT would
probably dramatically decrease performance.
You are correct that the sync system calls returns before all the
data is on the disk. This is historic practice, as in the old days
there was no way to wait for async I/O to finish and doing everything
synchronously (with bwrite) would have been really slow. I had thought
of adding a flags parameter to sync to let the user specify that they
wanted to wait for the I/O to complete, but never did so. As you note,
the code is all there to do it, there is just no way to pass that
information in from the sync system call. This semantic is the reason
for the folk wisdom that you should always type sync three times before
hitting the reboot button. The first sync was to get the I/O in the
disk queue. The next two were just to delay the human long enough for
the I/O to finish. If you want to make a change here, I would recommend
that you create a new system call for sync that takes a flags parameter
to specify sync or async behavior. This choice can then be reflected
back to a -s (sync) flag to the sync command.
Bug #2: data for block devices without mounted filesystems is not
flushed by sync(2).
Because sys_sync walks the list of mounted filesystems,
data for block devices is not sync()ed.
There are two sub-cases here.
Case 1: block devices accessed with write()
In this case, I don't know if data is flushed or not. We
walk the list of mounted filesystems, flushing data
for all their vnodes. This percolates down as given above.
But does the VFS_SYNC->VOP_FSYNC->vflushbuf() chain actually
catch the vnode for the block device? I guess it depends
whether that device vnode is on the mount-point's list of
vnodes. If it is, it should, I *think*, get written --
but then I don't understand why filesystems beneath the
root don't get written multiple times. Someone, please
help me understand this!
In any event, if the filesystem the block device is on
is itself mounted read-only, the block device definitely
doesn't get flushed because of the check for MNT_RDONLY
(vfs_syscalls.c line 520). So we can definitely lose
this way.
The block devices will be found in the vnode list for the filesystem
from which they come (usually the root device). You are correct that
they will be missed if the filesystem from which they come is mounted
read-only. The check for skipping read-only filesystems in sync should
be removed so that they will be found.
Case 2: block devices accessed with mmap()
mmap()ed data is flushed by vnode_pager_sync(mp) or by
uvm_vnp_sync(mp). These walk the list of uvn's (UVM)
or vm_objects (Mach VM), checking the mount point of
each corresponding vnode and flushing all dirty pages
for those which match the given mp. (uvm_vnode.c line
1984). If I'm correct that vp->v_mount for a device
node is in fact the filesystem the device node lives
in (and not NULL or something) then *usually* data
gets synced this way. However, we still lose for
device nodes that live on read-only filesystems.
I'm pretty sure I know how to fix this. I can just
change the semantics of uvm_vnp_sync()/vnode_pager_sync()
to remove the "mp" argument (and comparison), and move the
call outside the per-mount-point loop. I don't *think* this
needs to be protected with vfs_busy.
If it does, I propose to vfs_busy all filesystems,
call uvm_vnp_sync/vnode_pager_sync with the new
interface, then either vfs_unbusy all filesystems
and iterate over them vfs_busy-ing, sync-ing, and
un-busying as before, or leave them all vfs_busied,
VFS_SYNC them all, then vfs_unbusy them all; I'd
like suggestions on which approach is better as
well as whether or not I need to bother to protect
the uvm_vnp_sync/vnode_pager_sync with vfs_busy anyway.
(I think I do, since it protects from unmounting
while the sync is running)
Assuming that you fix the read-only filesystem problem in sync,
I do not see any need to do something special for mmap'ed block
devices as they will be found in the normal course of doing the
sync.
I'm quite curious about the write() case and the question about device
nodes' buffers being flushed/not flushed when the filesystems they
live on are flushed; if they are, I can't see why filesystems below
the root aren't flushed multiple times, once from the mount list and
once from their device's vnode being flushed from the filesystem it
lives in.
I'm hoping someone more familiar with the VFS code and buffer cache
can remove some of my Deep Confusion here.
--
Thor Lancelot Simon tls@rek.tjls.com
"And where do all these highways go, now that we are free?"
----- End forwarded message -----