Subject: Re: Question: various bugs in sync()?
To: None <tech-kern@netbsd.org>
From: Thor Lancelot Simon <tls@rek.tjls.com>
List: tech-kern
Date: 01/15/1999 16:10:56
On Fri, Jan 15, 1999 at 03:35:39AM -0500, Thor Lancelot Simon wrote:

I am really *much* more concerned about the following bug, in which data may
never be scheduled to be written, period.

I'm hoping someone can fill in some of the answers to the questions I couldn't
figure out for myself last night, particularly the no sync/double sync
question about block devices' vnodes.

> Bug #2: data for block devices without mounted filesystems is not
> 	flushed by sync(2).
> 
> 	Because sys_sync walks the list of mounted filesystems,
> 	data for block devices is not sync()ed.
> 
> 	There are two sub-cases here.
> 
> 	Case 1: block devices accessed with write()
> 
> 	In this case, I don't know if data is flushed or not.  We
> 	walk the list of mounted filesystems, flushing data
> 	for all their vnodes.  This percolates down as given above.
> 	But does the VFS_SYNC->VOP_FSYNC->vflushbuf() chain actually
> 	catch the vnode for the block device?  I guess it depends
> 	whether that device vnode is on the mount-point's list of
> 	vnodes.  If it is, it should, I *think*, get written --
> 	but then I don't understand why filesystems beneath the
> 	root don't get written multiple times.  Someone, please
> 	help me understand this!
> 
> 	In any event, if the filesystem the block device is on
> 	is itself mounted read-only, the block device definitely
> 	doesn't get flushed because of the check for MNT_RDONLY
> 	(vfs_syscalls.c line 520).  So we can definitely lose
> 	this way.
> 
> 	Case 2: block devices accessed with mmap()
> 
> 	mmap()ed data is flushed by vnode_pager_sync(mp) or by
> 	uvm_vnp_sync(mp).  These walk the list of uvn's (UVM)
> 	or vm_objects (Mach VM), checking the mount point of
> 	each corresponding vnode and flushing all dirty pages
> 	for those which match the given mp. (uvm_vnode.c line
> 	1984).  If I'm correct that vp->v_mount for a device
> 	node is in fact the filesystem the device node lives
> 	in (and not NULL or something) then *usually* data
> 	gets synced this way.  However, we still lose for
> 	device nodes that live on read-only filesystems.
> 
> 	I'm pretty sure I know how to fix this.  I can just
> 	change the semantics of uvm_vnp_sync()/vnode_pager_sync()
> 	to remove the "mp" argument (and comparison), and move the
> 	call outside the per-mount-point loop.  I don't *think* this
> 	needs to be protected with vfs_busy.
> 	
> 	If it does, I propose to vfs_busy all filesystems,
> 	call uvm_vnp_sync/vnode_pager_sync with the new
> 	interface, then either vfs_unbusy all filesystems
> 	and iterate over them vfs_busy-ing, sync-ing, and
> 	un-busying as before, or leave them all vfs_busied,
> 	VFS_SYNC them all, then vfs_unbusy them all; I'd
> 	like suggestions on which approach is better as
> 	well as whether or not I need to bother to protect
> 	the uvm_vnp_sync/vnode_pager_sync with vfs_busy anyway.
> 	(I think I do, since it protects from unmounting
> 	while the sync is running)
> 
> I'm quite curious about the write() case and the question about device
> nodes' buffers being flushed/not flushed when the filesystems they
> live on are flushed; if they are, I can't see why filesystems below
> the root aren't flushed multiple times, once from the mount list and
> once from their device's vnode being flushed from the filesystem it
> lives in.
> 
> I'm hoping someone more familiar with the VFS code and buffer cache
> can remove some of my Deep Confusion here.
>