tech-kern: Re: [tls@tjls.com: Question: various bugs in sync()?]

Subject: Re: [tls@tjls.com: Question: various bugs in sync()?]
To: None <tls@rek.tjls.com>
From: Kirk McKusick <mckusick@McKusick.COM>
List: tech-kern
Date: 01/15/1999 11:56:50
While debugging something unrelated I believe I've found multiple bugs in
sync().  I would like to be persuaded that this is not so.

Bug #1: sync(2) returns before all data is, in fact, on disk.

	I'm not sure this is a bug, but I'd like someone to explain to me
	why it's not.  sys_sync walks the list of mounted filesystems,
	vfs_busy()ing each and then calling VFS_SYNC(mp, MNT_NOWAIT, ...)
	(vfs_syscalls.c line 528) on each.

	Taking FFS as an example, VFS_SYNC() is ffs_sync, which walks the
	list of vnodes for the filesystem and VOP_FSYNC's each; MNT_NOWAIT
	implies !FSYNC_WAIT on the VOP_FSYNC. (ffs_vfsops.c line 822).

	VOP_FSYNC for FFS is genfs_fsync.  genfs_fsync calls vflushbuf()
	with a "sync" argument from the presence/absence of FSYNC_WAIT
	(genfs_vnops.c line 87).  vflushbuf() walks the list of dirty
	buffers for the vnode, scheduling asynchronous writes with
	bawrite() because "sync" isn't set. (vfs_subr.c, line 605).

	Conclusion: sync() returns once I/O for all delayed writes has
	been scheduled, but not completed.  That is, it converts
	delayed writes into asynchronous writes, which usually
	complete after it exits.

	I guess this is the traditional semantic.  However, it
	doesn't agree with what the sync(8) manual page says
	happens, and the behaviour is mentioned as a bug in the
	sync(2) manual page.  What do relevant standards require?

	I realize that changing the MNT_NOWAIT to MNT_WAIT would
	probably dramatically decrease performance.

You are correct that the sync system calls returns before all the
data is on the disk. This is historic practice, as in the old days
there was no way to wait for async I/O to finish and doing everything
synchronously (with bwrite) would have been really slow. I had thought
of adding a flags parameter to sync to let the user specify that they
wanted to wait for the I/O to complete, but never did so. As you note,
the code is all there to do it, there is just no way to pass that
information in from the sync system call. This semantic is the reason
for the folk wisdom that you should always type sync three times before
hitting the reboot button. The first sync was to get the I/O in the
disk queue. The next two were just to delay the human long enough for
the I/O to finish. If you want to make a change here, I would recommend
that you create a new system call for sync that takes a flags parameter
to specify sync or async behavior. This choice can then be reflected
back to a -s (sync) flag to the sync command.

    Bug #2: data for block devices without mounted filesystems is not
	flushed by sync(2).

	Because sys_sync walks the list of mounted filesystems,
	data for block devices is not sync()ed.

	There are two sub-cases here.

	Case 1: block devices accessed with write()

	In this case, I don't know if data is flushed or not.  We
	walk the list of mounted filesystems, flushing data
	for all their vnodes.  This percolates down as given above.
	But does the VFS_SYNC->VOP_FSYNC->vflushbuf() chain actually
	catch the vnode for the block device?  I guess it depends
	whether that device vnode is on the mount-point's list of
	vnodes.  If it is, it should, I *think*, get written --
	but then I don't understand why filesystems beneath the
	root don't get written multiple times.  Someone, please
	help me understand this!

	In any event, if the filesystem the block device is on
	is itself mounted read-only, the block device definitely
	doesn't get flushed because of the check for MNT_RDONLY
	(vfs_syscalls.c line 520).  So we can definitely lose
	this way.

The block devices will be found in the vnode list for the filesystem
from which they come (usually the root device). You are correct that
they will be missed if the filesystem from which they come is mounted
read-only. The check for skipping read-only filesystems in sync should
be removed so that they will be found.

	Case 2: block devices accessed with mmap()

	mmap()ed data is flushed by vnode_pager_sync(mp) or by
	uvm_vnp_sync(mp).  These walk the list of uvn's (UVM)
	or vm_objects (Mach VM), checking the mount point of
	each corresponding vnode and flushing all dirty pages
	for those which match the given mp. (uvm_vnode.c line
	1984).  If I'm correct that vp->v_mount for a device
	node is in fact the filesystem the device node lives
	in (and not NULL or something) then *usually* data
	gets synced this way.  However, we still lose for
	device nodes that live on read-only filesystems.

	I'm pretty sure I know how to fix this.  I can just
	change the semantics of uvm_vnp_sync()/vnode_pager_sync()
	to remove the "mp" argument (and comparison), and move the
	call outside the per-mount-point loop.  I don't *think* this
	needs to be protected with vfs_busy.
	
	If it does, I propose to vfs_busy all filesystems,
	call uvm_vnp_sync/vnode_pager_sync with the new
	interface, then either vfs_unbusy all filesystems
	and iterate over them vfs_busy-ing, sync-ing, and
	un-busying as before, or leave them all vfs_busied,
	VFS_SYNC them all, then vfs_unbusy them all; I'd
	like suggestions on which approach is better as
	well as whether or not I need to bother to protect
	the uvm_vnp_sync/vnode_pager_sync with vfs_busy anyway.
	(I think I do, since it protects from unmounting
	while the sync is running)

Assuming that you fix the read-only filesystem problem in sync,
I do not see any need to do something special for mmap'ed block
devices as they will be found in the normal course of doing the
sync.

    I'm quite curious about the write() case and the question about device
    nodes' buffers being flushed/not flushed when the filesystems they
    live on are flushed; if they are, I can't see why filesystems below
    the root aren't flushed multiple times, once from the mount list and
    once from their device's vnode being flushed from the filesystem it
    lives in.

    I'm hoping someone more familiar with the VFS code and buffer cache
    can remove some of my Deep Confusion here.

-- 
Thor Lancelot Simon	                                      tls@rek.tjls.com
	"And where do all these highways go, now that we are free?"

----- End forwarded message -----