Subject: kern/4269: msync() doesn't conform to POSIX
To: None <gnats-bugs@gnats.netbsd.org>
From: None <cgd@NetBSD.ORG>
List: netbsd-bugs
Date: 10/13/1997 20:56:46
>Number:         4269
>Category:       kern
>Synopsis:       msync() doesn't conform to POSIX
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people (Kernel Bug People)
>State:          open
>Class:          change-request
>Submitter-Id:   net
>Arrival-Date:   Mon Oct 13 14:05:04 1997
>Last-Modified:
>Originator:     Chris G. Demetriou
>Organization:
Kernel Hackers 'r' Us
>Release:        1.2G, but a few months old.
>Environment:
System: NetBSD brick.demetriou.com 1.2G NetBSD 1.2G (BRICK) #116: Wed Jul 16 14:03:06 PDT 1997 cgd@brick.demetriou.com:/usr/src/sys/arch/i386/compile/BRICK i386


>Description:
	msync() doesn't conform to POSIX.  IEEE Std 1003.1b-1993 specifies
	that msync() is to be invoked:

		msync(void *addr, size_t size, int flags);

	but our msync() only has two arguments.  This causes various
	third party applications which want to use msync() to break.
	(The one that I run into often is lmbench.)

>How-To-Repeat:
	Try to compile lmbench's lat_pagefault, or compile other programs
	which want a POSIX msync().

>Fix:
	Below are some diffs which implement a fix.  I'd suggest that they
	be applied (with the following as suggestions for what else
	to do while they're being applied):

		* vm_swap_12.c should really be renamed vm_12.c.
		* the new msync() syscall number might have to be
		  re-chosen; I didn't know which to use.
		* the syscall tables & headers need to be re-gen'd after
		  this patch is applied.
		* The libc major number probably has to be bumped for this
		  change to work right, or some other backward-compatibility
		  magic needs to be done.

	The documentation update could use a bit of cleanup.

	Some notes about the diffs:

		* the changes to vm_map_clean() should be OK; as far
		  as I can tell, nothing but msync() uses it anyway.
		* the standard and other OSes' manual pages (at least
		  the ones I checked) are unclear as to which combinations
		  of flags are allowed.  The behaviour of this code was
		  decided based on the results of calling the Digital UNIX
		  msync() with a bunch of flags.
		* the "Disallow wrap-wround" test was simply wrong.  There's
		  no reason that a size_t has to be the same size as
		  an int, etc.
		* A few error returns were changed to match the standard,
		  and a new VM internal error code was added so I wouldn't
		  have to overload a different code.
		* Use of MS_ASYNC would cause the process to hang with
		  "vospgw".  The code path it takes in that case looks
		  ... severly broken, and I'm not going to debug it, so
		  i hacked the code so that it always does sync I/O.

	There are a few source files in the tree which use msync().  They
	can be found easily enough with grep.  There are also a few sets of
	binary emulation code which stub out an msync() call for their OSes.
	They might benefit from this code, as well.

Index: sys/compat/common/vm_swap_12.c
===================================================================
RCS file: /cvsroot/src/sys/compat/common/vm_swap_12.c,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 vm_swap_12.c
*** vm_swap_12.c	1997/07/08 00:45:15	1.1.1.1
--- vm_swap_12.c	1997/10/10 23:19:20
***************
*** 37,42 ****
--- 37,43 ----
  #include <sys/mount.h>		/* needed for next include! */
  #include <sys/syscallargs.h>
  
+ #include <sys/mman.h>
  #include <vm/vm_swap.h>
  
  int
***************
*** 54,57 ****
--- 55,76 ----
  	SCARG(&ua, arg) = (void *)SCARG(uap, name);
  	SCARG(&ua, misc) = 0;	/* priority */
  	return (sys_swapctl(p, &ua, retval));
+ }
+ 
+ int
+ compat_12_sys_msync(p, v, retval)
+ 	struct proc *p;
+ 	void *v;
+ 	register_t *retval;
+ {
+ 	struct sys_msync_args ua;
+ 	struct compat_12_sys_msync_args /* {
+ 		syscallarg(caddr_t) addr;
+ 		syscallarg(size_t) len;
+ 	} */ *uap = v;
+ 
+ 	SCARG(&ua, addr) = SCARG(uap, addr);;
+ 	SCARG(&ua, len) = SCARG(uap, len);;
+ 	SCARG(&ua, flags) = MS_SYNC | MS_INVALIDATE;
+ 	return (sys_msync(p, &ua, retval));
  }
Index: sys/kern/syscalls.master
===================================================================
RCS file: /cvsroot/src/sys/kern/syscalls.master,v
retrieving revision 1.1.1.4
diff -c -r1.1.1.4 syscalls.master
*** syscalls.master	1997/07/08 00:46:26	1.1.1.4
--- syscalls.master	1997/10/10 23:21:38
***************
*** 139,145 ****
  63	COMPAT_43	{ int sys_getkerninfo(int op, char *where, int *size, \
  			    int arg); } ogetkerninfo
  64	COMPAT_43	{ int sys_getpagesize(void); } ogetpagesize
! 65	STD		{ int sys_msync(caddr_t addr, size_t len); }
  66	STD		{ int sys_vfork(void); }
  67	OBSOL		vread
  68	OBSOL		vwrite
--- 139,145 ----
  63	COMPAT_43	{ int sys_getkerninfo(int op, char *where, int *size, \
  			    int arg); } ogetkerninfo
  64	COMPAT_43	{ int sys_getpagesize(void); } ogetpagesize
! 65	COMPAT_12	{ int sys_msync(caddr_t addr, size_t len); }
  66	STD		{ int sys_vfork(void); }
  67	OBSOL		vread
  68	OBSOL		vwrite
***************
*** 494,496 ****
--- 494,497 ----
  270	STD		{ int sys_posix_rename(const char *from, \
  			    const char *to); }
  271	STD		{ int sys_swapctl(int cmd, void *arg, int misc); }
+ 272	STD		{ int sys_msync(void *addr, size_t len, int flags); }
Index: sys/sys/mman.h
===================================================================
RCS file: /cvsroot/src/sys/sys/mman.h,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 mman.h
*** mman.h	1997/04/19 01:23:09	1.1.1.1
--- mman.h	1997/10/13 18:47:30
***************
*** 79,84 ****
--- 79,91 ----
  #define	MADV_WILLNEED	3	/* will need these pages */
  #define	MADV_DONTNEED	4	/* dont need these pages */
  
+ /*
+  * Flags to msync
+  */
+ #define	MS_ASYNC	0x01	/* perform asynchronous writes */
+ #define	MS_SYNC		0x02	/* perform synchronous writes */
+ #define	MS_INVALIDATE	0x04	/* invalidate cached data */
+ 
  #ifndef _KERNEL
  
  #include <sys/cdefs.h>
***************
*** 88,94 ****
  caddr_t	mmap __P((caddr_t, size_t, int, int, int, off_t));
  int	mprotect __P((caddr_t, size_t, int));
  int	munmap __P((caddr_t, size_t));
! int	msync __P((caddr_t, size_t));
  int	mlock __P((caddr_t, size_t));
  int	munlock __P((caddr_t, size_t));
  int	madvise __P((caddr_t, size_t, int));
--- 95,101 ----
  caddr_t	mmap __P((caddr_t, size_t, int, int, int, off_t));
  int	mprotect __P((caddr_t, size_t, int));
  int	munmap __P((caddr_t, size_t));
! int	msync __P((void *, size_t, int));
  int	mlock __P((caddr_t, size_t));
  int	munlock __P((caddr_t, size_t));
  int	madvise __P((caddr_t, size_t, int));
Index: sys/vm/vm_map.c
===================================================================
RCS file: /cvsroot/src/sys/vm/vm_map.c,v
retrieving revision 1.1.1.3
diff -c -r1.1.1.3 vm_map.c
*** vm_map.c	1997/07/08 00:48:26	1.1.1.3
--- vm_map.c	1997/10/11 02:24:31
***************
*** 1403,1409 ****
  	}
  
  	/*
! 	 * Make a first pass to check for holes.
  	 */
  	for (current = entry; current->start < end; current = current->next) {
  		if (current->is_sub_map) {
--- 1403,1410 ----
  	}
  
  	/*
! 	 * Make a first pass to check for holes and (if invalidating)
! 	 * wired pages.
  	 */
  	for (current = entry; current->start < end; current = current->next) {
  		if (current->is_sub_map) {
***************
*** 1415,1420 ****
--- 1416,1425 ----
  		     current->end != current->next->start)) {
  			vm_map_unlock_read(map);
  			return(KERN_INVALID_ADDRESS);
+ 		}
+ 		if (current->wired_count) {
+ 			vm_map_unlock_read(map);
+ 			return(KERN_PAGES_LOCKED);
  		}
  	}
  
Index: sys/vm/vm_mmap.c
===================================================================
RCS file: /cvsroot/src/sys/vm/vm_mmap.c,v
retrieving revision 1.1.1.2
diff -c -r1.1.1.2 vm_mmap.c
*** vm_mmap.c	1997/07/08 00:48:27	1.1.1.2
--- vm_mmap.c	1997/10/13 17:20:06
***************
*** 279,301 ****
  	register_t *retval;
  {
  	struct sys_msync_args /* {
! 		syscallarg(caddr_t) addr;
  		syscallarg(size_t) len;
  	} */ *uap = v;
  	vm_offset_t addr;
  	vm_size_t size, pageoff;
  	vm_map_t map;
! 	int rv;
  	boolean_t syncio, invalidate;
  
  	addr = (vm_offset_t)SCARG(uap, addr);
  	size = (vm_size_t)SCARG(uap, len);
  #ifdef DEBUG
  	if (mmapdebug & (MDB_FOLLOW|MDB_SYNC))
! 		printf("msync(%d): addr %lx len %lx\n",
! 		    p->p_pid, addr, size);
  #endif
  
  	/*
  	 * Align the address to a page boundary,
  	 * and adjust the size accordingly.
--- 279,311 ----
  	register_t *retval;
  {
  	struct sys_msync_args /* {
! 		syscallarg(void *) addr;
  		syscallarg(size_t) len;
+ 		syscallarg(int) flags;
  	} */ *uap = v;
  	vm_offset_t addr;
  	vm_size_t size, pageoff;
  	vm_map_t map;
! 	int rv, flags;
  	boolean_t syncio, invalidate;
  
  	addr = (vm_offset_t)SCARG(uap, addr);
  	size = (vm_size_t)SCARG(uap, len);
+ 	flags = SCARG(uap, flags);
  #ifdef DEBUG
  	if (mmapdebug & (MDB_FOLLOW|MDB_SYNC))
! 		printf("msync(%d): addr %lx len %lx flags %x\n",
! 		    p->p_pid, addr, size, flags);
  #endif
  
+ 	/* sanity check flags */
+ 	if ((flags & ~(MS_ASYNC | MS_SYNC | MS_INVALIDATE)) != 0 ||
+ 	    (flags & (MS_ASYNC | MS_SYNC | MS_INVALIDATE)) == 0 ||
+ 	    (flags & (MS_ASYNC | MS_SYNC)) == (MS_ASYNC | MS_SYNC))
+ 		return (EINVAL);
+ 	if ((flags & (MS_ASYNC | MS_SYNC)) == 0)
+ 		flags |= MS_SYNC;
+ 
  	/*
  	 * Align the address to a page boundary,
  	 * and adjust the size accordingly.
***************
*** 306,313 ****
  	size = (vm_size_t) round_page(size);
  
  	/* Disallow wrap-around. */
! 	if (addr + (int)size < addr)
! 		return (EINVAL);
  
  	map = &p->p_vmspace->vm_map;
  	/*
--- 316,323 ----
  	size = (vm_size_t) round_page(size);
  
  	/* Disallow wrap-around. */
! 	if (addr + size < addr)
! 		return (ENOMEM);
  
  	map = &p->p_vmspace->vm_map;
  	/*
***************
*** 325,331 ****
  		rv = vm_map_lookup_entry(map, addr, &entry);
  		vm_map_unlock_read(map);
  		if (rv == FALSE)
! 			return (EINVAL);
  		addr = entry->start;
  		size = entry->end - entry->start;
  	}
--- 335,341 ----
  		rv = vm_map_lookup_entry(map, addr, &entry);
  		vm_map_unlock_read(map);
  		if (rv == FALSE)
! 			return (ENOMEM);
  		addr = entry->start;
  		size = entry->end - entry->start;
  	}
***************
*** 334,350 ****
  		printf("msync: cleaning/flushing address range [%lx-%lx)\n",
  		    addr, addr+size);
  #endif
  	/*
! 	 * Could pass this in as a third flag argument to implement
! 	 * Sun's MS_ASYNC.
  	 */
  	syncio = TRUE;
! 	/*
! 	 * XXX bummer, gotta flush all cached pages to ensure
! 	 * consistency with the file system cache.  Otherwise, we could
! 	 * pass this in to implement Sun's MS_INVALIDATE.
! 	 */
! 	invalidate = TRUE;
  	/*
  	 * Clean the pages and interpret the return value.
  	 */
--- 344,363 ----
  		printf("msync: cleaning/flushing address range [%lx-%lx)\n",
  		    addr, addr+size);
  #endif
+ 
+ #if 0
  	/*
! 	 * XXX Asynchronous msync() causes:
! 	 *	. the process to hang on wchan "vospgw", and
! 	 *	. a "vm_object_page_clean: pager_put error" message to
! 	 *	  be printed by the kernel.
  	 */
+ 	syncio = (flags & MS_SYNC) ? TRUE : FALSE;
+ #else
  	syncio = TRUE;
! #endif
! 	invalidate = (flags & MS_INVALIDATE) ? TRUE : FALSE;
! 
  	/*
  	 * Clean the pages and interpret the return value.
  	 */
***************
*** 353,361 ****
  	case KERN_SUCCESS:
  		break;
  	case KERN_INVALID_ADDRESS:
! 		return (EINVAL);	/* Sun returns ENOMEM? */
  	case KERN_FAILURE:
  		return (EIO);
  	default:
  		return (EINVAL);
  	}
--- 366,376 ----
  	case KERN_SUCCESS:
  		break;
  	case KERN_INVALID_ADDRESS:
! 		return (ENOMEM);
  	case KERN_FAILURE:
  		return (EIO);
+ 	case KERN_PAGES_LOCKED:
+ 		return (EBUSY);
  	default:
  		return (EINVAL);
  	}
Index: sys/vm/vm_param.h
===================================================================
RCS file: /cvsroot/src/sys/vm/vm_param.h,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 vm_param.h
*** vm_param.h	1997/04/19 01:16:19	1.1.1.1
--- vm_param.h	1997/10/10 23:15:14
***************
*** 124,129 ****
--- 124,130 ----
  #define	KERN_RESOURCE_SHORTAGE	6
  #define	KERN_NOT_RECEIVER	7
  #define	KERN_NO_ACCESS		8
+ #define	KERN_PAGES_LOCKED	9
  
  #ifndef ASSEMBLER
  /*
Index: lib/libc/sys/msync.2
===================================================================
RCS file: /cvsroot/src/lib/libc/sys/msync.2,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 msync.2
*** msync.2	1997/04/27 16:56:24	1.1.1.1
--- msync.2	1997/10/11 04:12:08
***************
*** 33,55 ****
  .\"
  .\"	@(#)msync.2	8.1 (Berkeley) 6/9/93
  .\"
! .Dd June 9, 1993
  .Dt MSYNC 2
! .Os
  .Sh NAME
  .Nm msync
  .Nd synchronize a mapped region
  .Sh SYNOPSIS
- .Fd #include <sys/types.h>
  .Fd #include <sys/mman.h>
  .Ft int
! .Fn msync "caddr_t addr" "size_t len"
  .Sh DESCRIPTION
  The
  .Fn msync
! system call
! writes any modified pages back to the filesystem and updates
! the file modification time.
  If
  .Fa len
  is 0, all modified pages within the region containing
--- 33,56 ----
  .\"
  .\"	@(#)msync.2	8.1 (Berkeley) 6/9/93
  .\"
! .Dd October 10, 1997 
  .Dt MSYNC 2
! .Os NetBSD 1.3
  .Sh NAME
  .Nm msync
  .Nd synchronize a mapped region
  .Sh SYNOPSIS
  .Fd #include <sys/mman.h>
  .Ft int
! .Fn msync "void *addr" "size_t len" "int flags"
  .Sh DESCRIPTION
  The
  .Fn msync
! system call writes all pages with shared modifications
! in the specified
! region of the process's address space back to permanent
! storage, and, if requested, invalidates cached data mapped
! in the region.
  If
  .Fa len
  is 0, all modified pages within the region containing
***************
*** 57,95 ****
  will be flushed;
  if
  .Fa len
! is non-zero, only the pages containing
  .Fa addr
  and
  .Fa len
! succeeding locations will be examined.
  Any required synchronization of memory caches
  will also take place at this time.
  Filesystem operations on a file that is mapped for shared modifications
  are unpredictable except after an
  .Fn msync .
  .Sh ERRORS
  The following errors may be reported:
  .Bl -tag -width Er
! .It Bq Er EINVAL
  The
! .Fa addr
! parameter was not page aligned.
  .It Bq Er EINVAL
! The
! .Fa addr
! parameter did not specify an address part of a mapped region.
  .It Bq Er EINVAL
  The
! .Fa len
! parameter was negative.
  .It Bq Er EIO
! An I/O error occured while writing to the file system.
  .Sh SEE ALSO
! .Xr madvise 2 ,
! .Xr munmap 2 ,
! .Xr mprotect 2 ,
! .Xr mincore 2
  .Sh HISTORY
  The
  .Fn msync
! function first appeared in 4.4BSD.
--- 58,123 ----
  will be flushed;
  if
  .Fa len
! is non-zero, only modified pages containing
  .Fa addr
  and
  .Fa len
! succeeding locations will be flushed.
  Any required synchronization of memory caches
  will also take place at this time.
  Filesystem operations on a file that is mapped for shared modifications
  are unpredictable except after an
  .Fn msync .
+ .Pp
+ The
+ .Fa flags
+ argument is formed by
+ .Em or Ns 'ing
+ the following values
+ .Pp
+ .Bd -literal -offset indent -compact
+ MS_ASYNC	Perform asynchronous writes.
+ MS_SYNC		Perform synchronous writes.
+ MS_INVALIDATE	Invalidate cached data after writing.
+ .Ed
  .Sh ERRORS
  The following errors may be reported:
  .Bl -tag -width Er
! .It Bq Er EBUSY
  The
! .Dv MS_INVALIDATE
! flag was specified and a portion of the specified region
! was locked with
! .Xr mlock 2 .
  .It Bq Er EINVAL
! The specified
! .Fa flags
! argument was invalid.
  .It Bq Er EINVAL
  The
! .Fa addr
! parameter was not page aligned.
  .It Bq Er EIO
! An I/O error occured while writing.
! .It Bq Er ENOMEM
! Addresses in the specified region are outside the range allowed
! for the address space of the process, or specify one or more pages
! which are unmapped.
  .Sh SEE ALSO
! .Xr mlock 2 ,
! .Xr mmap 2 ,
! .Xr munlock 2
! .Sh BUGS
! Writes are currently done synchronously even if the
! .Dv MS_ASYNC
! flag is specified.
  .Sh HISTORY
  The
  .Fn msync
! function first appeared in
! .Bx 4.4 .
! It was modified to conform to
! .St -p1003.1b-93
! in
! .Nx 1.3 .
! .P
>Audit-Trail:
>Unformatted: