Subject: Question: various bugs in sync()?
To: None <tech-kern@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 01/15/1999 03:35:39
While debugging something unrelated I believe I've found multiple bugs in
sync().  I would like to be persuaded that this is not so.

Bug #1: sync(2) returns before all data is, in fact, on disk.

	I'm not sure this is a bug, but I'd like someone to explain to me
	why it's not.  sys_sync walks the list of mounted filesystems,
	vfs_busy()ing each and then calling VFS_SYNC(mp, MNT_NOWAIT, ...)
	(vfs_syscalls.c line 528) on each.

	Taking FFS as an example, VFS_SYNC() is ffs_sync, which walks the
	list of vnodes for the filesystem and VOP_FSYNC's each; MNT_NOWAIT
	implies !FSYNC_WAIT on the VOP_FSYNC. (ffs_vfsops.c line 822).

	VOP_FSYNC for FFS is genfs_fsync.  genfs_fsync calls vflushbuf()
	with a "sync" argument from the presence/absence of FSYNC_WAIT
	(genfs_vnops.c line 87).  vflushbuf() walks the list of dirty
	buffers for the vnode, scheduling asynchronous writes with
	bawrite() because "sync" isn't set. (vfs_subr.c, line 605).

	Conclusion: sync() returns once I/O for all delayed writes has
	been scheduled, but not completed.  That is, it converts
	delayed writes into asynchronous writes, which usually
	complete after it exits.

	I guess this is the traditional semantic.  However, it
	doesn't agree with what the sync(8) manual page says
	happens, and the behaviour is mentioned as a bug in the
	sync(2) manual page.  What do relevant standards require?

	I realize that changing the MNT_NOWAIT to MNT_WAIT would
	probably dramatically decrease performance.

Bug #2: data for block devices without mounted filesystems is not
	flushed by sync(2).

	Because sys_sync walks the list of mounted filesystems,
	data for block devices is not sync()ed.

	There are two sub-cases here.

	Case 1: block devices accessed with write()

	In this case, I don't know if data is flushed or not.  We
	walk the list of mounted filesystems, flushing data
	for all their vnodes.  This percolates down as given above.
	But does the VFS_SYNC->VOP_FSYNC->vflushbuf() chain actually
	catch the vnode for the block device?  I guess it depends
	whether that device vnode is on the mount-point's list of
	vnodes.  If it is, it should, I *think*, get written --
	but then I don't understand why filesystems beneath the
	root don't get written multiple times.  Someone, please
	help me understand this!

	In any event, if the filesystem the block device is on
	is itself mounted read-only, the block device definitely
	doesn't get flushed because of the check for MNT_RDONLY
	(vfs_syscalls.c line 520).  So we can definitely lose
	this way.

	Case 2: block devices accessed with mmap()

	mmap()ed data is flushed by vnode_pager_sync(mp) or by
	uvm_vnp_sync(mp).  These walk the list of uvn's (UVM)
	or vm_objects (Mach VM), checking the mount point of
	each corresponding vnode and flushing all dirty pages
	for those which match the given mp. (uvm_vnode.c line
	1984).  If I'm correct that vp->v_mount for a device
	node is in fact the filesystem the device node lives
	in (and not NULL or something) then *usually* data
	gets synced this way.  However, we still lose for
	device nodes that live on read-only filesystems.

	I'm pretty sure I know how to fix this.  I can just
	change the semantics of uvm_vnp_sync()/vnode_pager_sync()
	to remove the "mp" argument (and comparison), and move the
	call outside the per-mount-point loop.  I don't *think* this
	needs to be protected with vfs_busy.
	
	If it does, I propose to vfs_busy all filesystems,
	call uvm_vnp_sync/vnode_pager_sync with the new
	interface, then either vfs_unbusy all filesystems
	and iterate over them vfs_busy-ing, sync-ing, and
	un-busying as before, or leave them all vfs_busied,
	VFS_SYNC them all, then vfs_unbusy them all; I'd
	like suggestions on which approach is better as
	well as whether or not I need to bother to protect
	the uvm_vnp_sync/vnode_pager_sync with vfs_busy anyway.
	(I think I do, since it protects from unmounting
	while the sync is running)

I'm quite curious about the write() case and the question about device
nodes' buffers being flushed/not flushed when the filesystems they
live on are flushed; if they are, I can't see why filesystems below
the root aren't flushed multiple times, once from the mount list and
once from their device's vnode being flushed from the filesystem it
lives in.

I'm hoping someone more familiar with the VFS code and buffer cache
can remove some of my Deep Confusion here.

-- 
Thor Lancelot Simon	                                      tls@rek.tjls.com
	"And where do all these highways go, now that we are free?"