Subject: Question: various bugs in sync()?
To: None <tech-kern@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 01/15/1999 03:35:39
While debugging something unrelated I believe I've found multiple bugs in
sync(). I would like to be persuaded that this is not so.
Bug #1: sync(2) returns before all data is, in fact, on disk.
I'm not sure this is a bug, but I'd like someone to explain to me
why it's not. sys_sync walks the list of mounted filesystems,
vfs_busy()ing each and then calling VFS_SYNC(mp, MNT_NOWAIT, ...)
(vfs_syscalls.c line 528) on each.
Taking FFS as an example, VFS_SYNC() is ffs_sync, which walks the
list of vnodes for the filesystem and VOP_FSYNC's each; MNT_NOWAIT
implies !FSYNC_WAIT on the VOP_FSYNC. (ffs_vfsops.c line 822).
VOP_FSYNC for FFS is genfs_fsync. genfs_fsync calls vflushbuf()
with a "sync" argument from the presence/absence of FSYNC_WAIT
(genfs_vnops.c line 87). vflushbuf() walks the list of dirty
buffers for the vnode, scheduling asynchronous writes with
bawrite() because "sync" isn't set. (vfs_subr.c, line 605).
Conclusion: sync() returns once I/O for all delayed writes has
been scheduled, but not completed. That is, it converts
delayed writes into asynchronous writes, which usually
complete after it exits.
I guess this is the traditional semantic. However, it
doesn't agree with what the sync(8) manual page says
happens, and the behaviour is mentioned as a bug in the
sync(2) manual page. What do relevant standards require?
I realize that changing the MNT_NOWAIT to MNT_WAIT would
probably dramatically decrease performance.
Bug #2: data for block devices without mounted filesystems is not
flushed by sync(2).
Because sys_sync walks the list of mounted filesystems,
data for block devices is not sync()ed.
There are two sub-cases here.
Case 1: block devices accessed with write()
In this case, I don't know if data is flushed or not. We
walk the list of mounted filesystems, flushing data
for all their vnodes. This percolates down as given above.
But does the VFS_SYNC->VOP_FSYNC->vflushbuf() chain actually
catch the vnode for the block device? I guess it depends
whether that device vnode is on the mount-point's list of
vnodes. If it is, it should, I *think*, get written --
but then I don't understand why filesystems beneath the
root don't get written multiple times. Someone, please
help me understand this!
In any event, if the filesystem the block device is on
is itself mounted read-only, the block device definitely
doesn't get flushed because of the check for MNT_RDONLY
(vfs_syscalls.c line 520). So we can definitely lose
this way.
Case 2: block devices accessed with mmap()
mmap()ed data is flushed by vnode_pager_sync(mp) or by
uvm_vnp_sync(mp). These walk the list of uvn's (UVM)
or vm_objects (Mach VM), checking the mount point of
each corresponding vnode and flushing all dirty pages
for those which match the given mp. (uvm_vnode.c line
1984). If I'm correct that vp->v_mount for a device
node is in fact the filesystem the device node lives
in (and not NULL or something) then *usually* data
gets synced this way. However, we still lose for
device nodes that live on read-only filesystems.
I'm pretty sure I know how to fix this. I can just
change the semantics of uvm_vnp_sync()/vnode_pager_sync()
to remove the "mp" argument (and comparison), and move the
call outside the per-mount-point loop. I don't *think* this
needs to be protected with vfs_busy.
If it does, I propose to vfs_busy all filesystems,
call uvm_vnp_sync/vnode_pager_sync with the new
interface, then either vfs_unbusy all filesystems
and iterate over them vfs_busy-ing, sync-ing, and
un-busying as before, or leave them all vfs_busied,
VFS_SYNC them all, then vfs_unbusy them all; I'd
like suggestions on which approach is better as
well as whether or not I need to bother to protect
the uvm_vnp_sync/vnode_pager_sync with vfs_busy anyway.
(I think I do, since it protects from unmounting
while the sync is running)
I'm quite curious about the write() case and the question about device
nodes' buffers being flushed/not flushed when the filesystems they
live on are flushed; if they are, I can't see why filesystems below
the root aren't flushed multiple times, once from the mount list and
once from their device's vnode being flushed from the filesystem it
lives in.
I'm hoping someone more familiar with the VFS code and buffer cache
can remove some of my Deep Confusion here.
--
Thor Lancelot Simon tls@rek.tjls.com
"And where do all these highways go, now that we are free?"