Subject: Re: VM hangs with latest bits?
To: Bill Sommerfeld <sommerfeld@orchard.medford.ma.us>
From: Jason Thorpe <thorpej@nas.nasa.gov>
List: current-users
Date: 02/24/1997 14:15:28
On Mon, 24 Feb 1997 16:59:41 GMT 
 Bill Sommerfeld <sommerfeld@orchard.medford.ma.us> wrote:

 > Anyone else seeing severe VM system hangs with the today's bits?

"Yes."  It was killing my SS2.

 > symptom is: lockup which starts when starting memory hogs (larger X
 > clients, emacs, ..).  A bunch of processes appear to be hung in

... ld on a debugging kernel.... :-)

 > .. which appears to be "wait for pageout daemon to clean pages" logic.

Yes... So, the problem was deadlock between pagedaemon and the new,
more aggressive vm_object_collapse() code..  Essentially, a pageout
could trigger a collapse, but the collapse code would see that paging
was in progress and wait, thus causing deadlock.  I've committed the
following patch from Charles Hannum which has "fixed" the problem.
(XXX - the whole situation that causes this deadlock really needs to
be avoided in the first place, but this is a "good enough" stopgap.)

Jason R. Thorpe                                       thorpej@nas.nasa.gov
NASA Ames Research Center                               Home: 408.866.1912
NAS: M/S 258-6                                          Work: 415.604.0935
Moffett Field, CA 94035                                Pager: 415.428.6939

diff -rc2 t/vm_object.c ./vm_object.c
*** t/vm_object.c	Sat Feb 22 17:39:46 1997
--- ./vm_object.c	Sun Feb 23 15:22:25 1997
***************
*** 1189,1197 ****
--- 1189,1206 ----
  	 *    we're deleting.  We'll never notice this case, because the
  	 *    backing object we're deleting won't have the page.
+ 	 *
+ 	 * XXXXX FIXME
+ 	 * Because pagedaemon can call vm_object_collapse(), we must *not*
+ 	 * sleep waiting for pages.
  	 */
  
  	vm_object_unlock(object);
  RetryRename:
+ #if 0 /* XXXXX FIXME */
  	vm_object_paging_wait(backing_object);
+ #else
+ 	if (vm_object_paging(backing_object))
+ 		goto fail;
+ #endif
  	/*
  	 * While we were asleep, the parent object might have been deleted.  If
***************
*** 1313,1320 ****
--- 1322,1333 ----
  			    paged_offset);
  			if (backing_page == NULL) {
+ #if 0 /* XXXXX FIXME */
  				vm_object_unlock(backing_object);
  				VM_WAIT;
  				vm_object_lock(backing_object);
  				goto RetryRename;
+ #else
+ 				goto fail;
+ #endif
  			}
  
***************
*** 1341,1344 ****
--- 1354,1363 ----
  			}
  
+ #ifdef DIAGNOSTIC
+ 			if (rv != VM_PAGER_OK)
+ 				panic("vm_object_overlay: pager returned %d",
+ 				    rv);
+ #endif
+ 
  			/*
  			 * The pager might have moved the page while we
***************
*** 1434,1439 ****
  	 */
  	if (vm_object_paging(backing_object) ||
! 	    backing_object->pager != NULL)
  		goto fail;
  
  	/*
--- 1453,1460 ----
  	 */
  	if (vm_object_paging(backing_object) ||
! 	    backing_object->pager != NULL) {
! 		vm_object_unlock(object);
  		goto fail;
+ 	}
  
  	/*
***************
*** 1465,1468 ****
--- 1486,1490 ----
  			 * Page still needed.  Can't go any further.
  			 */
+ 			vm_object_unlock(object);
  			goto fail;
  		}
***************
*** 1733,1740 ****
  		return;
  
! 	iprintf(pr, "Object 0x%lx: size=0x%lx, res=%d, ref=%d, ",
  		(long) object, (long) object->size,
! 		object->resident_page_count, object->ref_count);
! 	(*pr)("pager=0x%lx+0x%lx, shadow=(0x%lx)+0x%lx\n",
  	       (long) object->pager, (long) object->paging_offset,
  	       (long) object->shadow, (long) object->shadow_offset);
--- 1755,1764 ----
  		return;
  
! 	iprintf(pr, "Object 0x%lx: size=0x%lx, res=%d, ref=%d, flags=0x%x, ",
  		(long) object, (long) object->size,
! 		object->resident_page_count, object->ref_count,
! 		object->flags);
! 	(*pr)("pip=%d, pager=0x%lx+0x%lx, shadow=(0x%lx)+0x%lx\n",
! 	       object->paging_in_progress,
  	       (long) object->pager, (long) object->paging_offset,
  	       (long) object->shadow, (long) object->shadow_offset);