Subject: SMP improvements for pmap
To: None <port-i386@netbsd.org>
From: Stephan Uphoff <ups@stups.com>
List: port-i386
Date: 05/21/2003 19:44:54
This is a multipart MIME message.

--==_Exmh_-20711097060
Content-Type: text/plain; charset=us-ascii


I finally had the time to take another look at the i386 pmap code.

The attached patch seems to reduce the number of TLB shootdown 
interrupts substantially and fixes some SMP race conditions.

For details please see the patch description below.

Any comments, benchmarks or suggestions would be greatly appreciated. 

	Stephan


---------------------------------------------------------------------

SMP Bug Fixes:
--------------
pmap_do_remove(), pmap_page_remove(): Delay putting PTPs on the free
page list until it is safe to do so. (After TLB shootdown)
This is also needed for uniprocessor CPUs with opportunistic TLB loads.
( For more information read the February mail thread "i386 pmap bug" on
 port-i386 and read about opportunistic TLB loads below)


pmap_enter(): Eliminate race condition that could cause lost PG_M and PG_U
              bits for a managed page.
              (  Attributes were read before entry was zapped, changes
               between reading and zapping were lost)


SMP optimizations:
------------------

Avoid TLB shootdown operations whenever possible.

This is especially beneficial for disk I/O from/to UVM pages
where the pages were accessed by DMA and not by the CPU and
a TLB shootdown is not needed.

How can we avoid a TLB shootdown ?

The Intel manuals are extremely vague when it comes to lazy TLB
shootdown (no OS can afford strict TLB shootdown).

However following all available material (manuals, white papers,
Intel errata descriptions, existing lazy TLB shootdown implementations,
US Patent 5,680,565, etc.) I believe we can assume the following 
simplified model for PTE handling:

------------------------------------------------------------------------
(a) If CPU needs to set the PG_U bit in a PTE entry to make it usable
    for address translation, it does so by ATOMICALLY checking the PG_V
    bit of a PTE, and if PG_V is set, it sets the PG_U bit.

(b) If CPU needs to set the PG_M bit in a PTE entry to make it usable
    for write access, it does so by ATOMICALLY checking 
    both the PG_V and PG_RW bits of a PTE, and if they are both set,
    it sets the PG_M bit.

(c) The CPU only uses or caches an entry for address translation
    if it has observed both the PG_U and the PG_V set at the same time.

(d) For write access the CPU only uses or caches an entry for 
    address translation if it has observed all of PG_M, PG_RW, PG_U
    and the PG_V set at the same time.
 
(e) The CPU might opportunistically load valid cache entries as described
    in (c,d)  (see also Intel's US Patent 5,680,565)

--------------------------------------------------------------------------

It follows that there is no need to flush the TLB entry after an 
old PTE was exchanged atomically with a new PTE if:

1) The CPU could NOT use the old PTE for caching as described in (c,d)
2) The old PTE and the new valid PTE contain the same page frame
   AND either the new pte allows write access or the old PTE was unusable
   for write as described in (d).

Another optimization is to always clear both PG_U and PG_M bits in 
a PTE if we are required to schedule a TLB shootdown in pmap_clear_attrs().
This might save a TLB shootdown later.


Open Issues:
-------------
pmap_enter() does not replace the old mapping atomically with the new mapping
             if the old mapping points to a managed page.
	     Other CPUs might observe va to be unmapped (pte == 0) before
             pmap_enter sets the new mapping.

	     I don't think this is a problem.
             If it is, the pve handling needs to be changed.

pmap_do_remove(), pmap_page_remove():
             Queuing of the PTP pages for later deletion is ugly because of the
	     misuse of the listq fields, and unnecessary complex, because of
             the use of the TAILQ macros.
	     


--==_Exmh_-20711097060
Content-Type: text/plain; name="pmap.c.patch"; charset=us-ascii
Content-Description: pmap.c.patch
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment; filename="pmap.c.patch"


diff -r1.154 pmap.c
741c741
< 	if (pmap_valid_entry(opte)) {
---
> 	 if ((opte & (PG_V | PG_U)) =3D=3D (PG_V | PG_U))  {
751c751
< 	}
---
> 		}
789,790c789,792
< #endif
< 		pmap_tlb_shootdown(pmap_kernel(), va, opte, &cpumask);
---
> #endif	=

> 		if ((opte & (PG_V | PG_U)) =3D=3D (PG_V | PG_U)) {
> 			pmap_tlb_shootdown(pmap_kernel(), va, opte, &cpumask);
> 		}
2176,2178c2178,2181
< 		pmap_tlb_shootdown(pmap, startva, opte, cpumaskp);
< =

< 		if (ptp)
---
> 		if (opte & PG_U)
> 			pmap_tlb_shootdown(pmap, startva, opte, cpumaskp);
> 		=

> 		if (ptp) {
2179a2183,2186
> 			/* Make sure that the PDE is flushed */
> 			if ((ptp->wire_count <=3D 1) && !(opte & PG_U))
> 				pmap_tlb_shootdown(pmap, startva, opte, cpumaskp);
> 		}
2258,2259c2265,2266
< 	if (ptp)
< 		ptp->wire_count--;		/* dropping a PTE */
---
> 	if (opte & PG_U)
> 		pmap_tlb_shootdown(pmap, va, opte, cpumaskp);
2261c2268,2272
< 	pmap_tlb_shootdown(pmap, va, opte, cpumaskp);
---
> 	if (ptp) {
> 		ptp->wire_count--;		/* dropping a PTE */
> 		/* Make sure that the PDE is flushed */
> 		if ((ptp->wire_count <=3D 1) && !(opte & PG_U))
> 			pmap_tlb_shootdown(pmap, va, opte, cpumaskp);
2262a2274
> 	}
2327c2339,2340
< =

---
> 	TAILQ_HEAD(,vm_page) empty_ptps;
> 	=

2331a2345,2346
> 	TAILQ_INIT(&empty_ptps);
> =

2403c2418,2420
< 				uvm_pagefree(ptp);
---
> 				uvm_pagerealloc(ptp,NULL,0);
> 				TAILQ_INSERT_TAIL(&empty_ptps,ptp,listq);
> =

2408c2425,2431
< 		PMAP_MAP_TO_HEAD_UNLOCK();
---
> 		PMAP_MAP_TO_HEAD_UNLOCK();	=

> =

> 		/* Now we can free unused ptps */
> 		TAILQ_FOREACH(ptp,&empty_ptps,listq )
> 			{
> 				uvm_pagefree(ptp);
> 			}
2493c2516,2519
< 			uvm_pagefree(ptp);
---
> =

> 			/* Postpone free to shootdown */
> 			uvm_pagerealloc(ptp,NULL,0);
> 			TAILQ_INSERT_TAIL(&empty_ptps,ptp,listq);
2499a2526,2531
> =

> 	/* Now we can free unused ptps */
> 	TAILQ_FOREACH(ptp,&empty_ptps,listq )
> 	  {
> 	    uvm_pagefree(ptp);
> 	  }
2517a2550,2551
> 	TAILQ_HEAD(,vm_page) empty_ptps;
> 	struct vm_page *ptp;
2528a2563,2564
> 	=

> =

2529a2566,2567
> 	=

> 	TAILQ_INIT(&empty_ptps);
2561a2600
> =

2564c2603,2605
< 		pmap_tlb_shootdown(pve->pv_pmap, pve->pv_va, opte, &cpumask);
---
> 		/* Shootdown only if referenced */
> 		if (opte & PG_U)
> 		  pmap_tlb_shootdown(pve->pv_pmap, pve->pv_va, opte, &cpumask);
2573a2615,2618
> 				/* Do we have to shootdown the page just to get the pte out of the =
TLB ?*/
> 				if(!(opte & PG_U))
> 					pmap_tlb_shootdown(pve->pv_pmap, pve->pv_va, opte, &cpumask);
> =

2575,2576c2620,2621
< 				    &pve->pv_pmap->pm_pdir[pdei(pve->pv_va)],
< 				    0);
---
> 					&pve->pv_pmap->pm_pdir[pdei(pve->pv_va)],
> 					0);
2595c2640,2644
< 				uvm_pagefree(pve->pv_ptp);
---
> =

> 				/* Free only after the shootdown */
> 				uvm_pagerealloc(pve->pv_ptp,NULL,0);
> 				TAILQ_INSERT_TAIL(&empty_ptps,pve->pv_ptp,listq);
> =

2607a2657,2662
> =

> 	/* Now we can free unused ptps */
> 	TAILQ_FOREACH(ptp,&empty_ptps,listq )
> 	  {
> 	    uvm_pagefree(ptp);
> 	  }
2695c2750,2751
< 	pt_entry_t *ptes, opte;
---
> 	volatile pt_entry_t *ptes;
> 	pt_entry_t opte;
2724a2781,2806
> 			/* We need to do something */
> 			if (clearbits =3D=3D PG_RW) {
> 				=

> 				result |=3D PG_RW;
> =

> 				/* On write protect we might not need to flush =

> 				 *the TLB
> 				 */
> =

> 				/* First zap the RW bit! */
> 				x86_atomic_clearbits_l(&ptes[x86_btop(pve->pv_va)],PG_RW) ; =

> 				opte =3D ptes[x86_btop(pve->pv_va)];
> 				=

> 				/* Then test if it is not cached as RW the TLB */
> 				if (!(opte & PG_M))
> 					goto no_tlb_shootdown;
> 				=

> 			=

> 			}
> 			=

> 			/* Since we need a shootdown me might as well
> 			 * always clear PG_U AND PG_M
> 			 */
> 			=

> =

>        		       	opte =3D x86_atomic_testset_ul(&ptes[x86_btop(pve->pv_=
va)],(opte & ~(PG_U | PG_M)));   /* zap! */
2726,2727c2808,2809
< 			x86_atomic_clearbits_l(&ptes[x86_btop(pve->pv_va)],
< 			    (opte & clearbits));	/* zap! */
---
> 			*myattrs |=3D (opte & ~(clearbits));
> =

2729c2811,2813
< 			    &cpumask);
---
> 					   &cpumask);
> 		no_tlb_shootdown:
> 		=

2740a2825
> =

2772c2857,2858
< 	pt_entry_t *ptes, *spte, *epte;
---
>         pt_entry_t *ptes, *epte;
>         volatile pt_entry_t  *spte;
2817,2818c2903,2906
< 				pmap_tlb_shootdown(pmap,
< 				    x86_ptob(spte - ptes), *spte, &cpumask);
---
> 				if(*spte & PG_M)
> 				  pmap_tlb_shootdown(pmap,
> 						     x86_ptob(spte - ptes)
> 						     , *spte, &cpumask);
2856c2944
< 			ptes[x86_btop(va)] &=3D ~PG_W;
---
> 		  x86_atomic_clearbits_l(&ptes[x86_btop(va)], PG_W);
2925d3012
< 	int ptpdelta, wireddelta, resdelta;
2940a3028,3039
> 	=

> 	npte =3D pa | protection_codes[prot] | PG_V;
> 	=

> 	if (wired)
> 	        npte |=3D PG_W;
> 	=

> 	if (va < VM_MAXUSER_ADDRESS)
> 		npte |=3D PG_u;
> 	else if (va < VM_MAX_ADDRESS)
> 		npte |=3D (PG_u | PG_RW);	/* XXXCDC: no longer needed? */
> 	if (pmap =3D=3D pmap_kernel())
> 		npte |=3D pmap_pg_g;
2945,2947d3043
< 	/*
< 	 * map in ptes and get a pointer to our PTP (unless we are the kernel)=

< 	 */
2961a3058,3062
> 	/* Get first view on old PTE =

> 	 * on SMP the PTE might gain PG_U and PG_M flags
> 	 * before we zap it later
> 	 */
>        =

2970c3071
< 		/*
---
> 	         /*
2975,2983c3076,3077
< =

< 		resdelta =3D 0;
< 		if (wired && (opte & PG_W) =3D=3D 0)
< 			wireddelta =3D 1;
< 		else if (!wired && (opte & PG_W) !=3D 0)
< 			wireddelta =3D -1;
< 		else
< 			wireddelta =3D 0;
< 		ptpdelta =3D 0;
---
> 		pmap->pm_stats.wired_count +=3D ( (npte & PG_W) ? 1 : 0 =

> 						- (opte & PG_W) ? 1 : 0);
2990a3085,3102
> 			=

> 	                npte |=3D (opte & PG_PVLIST);
> 			=

> 			opte =3D x86_atomic_testset_ul(&ptes[x86_btop(va)],npte);   /* zap! =
*/
> 			=

> 			/* Any change in the protection level that the CPU
> 			 * should know about ? =

> 			 */
> =

> 			if ( (npte & PG_RW)
> 			     || ((opte & (PG_M | PG_RW)) !=3D (PG_M | PG_RW))) {
> =

> 				/* No need to flush the TLB */
> 				/* Just add old  PG_M, ... flags in new entry */
> 				x86_atomic_setbits_l(&ptes[x86_btop(va)],opte & (PG_M | PG_U));
> =

> 				goto out_ok;
> 			}
2991a3104,3105
>  =

> 			/* Might be cached in the TLB as beeing writeable */
3001a3116,3117
> =

> 				=

3006,3007c3122
< 			} else {
< 				pvh =3D NULL;	/* ensure !PG_PVLIST */
---
> =

3009c3124,3126
< 			goto enter_now;
---
> 			goto shootdown_now;
> 	    =

> 		=

3015a3133
> 		=

3028a3147,3149
> =

> 			opte =3D x86_atomic_testset_ul(&ptes[x86_btop(va)],0);   /* zap! */
> =

3038,3047c3159,3164
< 		pve =3D NULL;
< 		resdelta =3D 1;
< 		if (wired)
< 			wireddelta =3D 1;
< 		else
< 			wireddelta =3D 0;
< 		if (ptp)
< 			ptpdelta =3D 1;
< 		else
< 			ptpdelta =3D 0;
---
> 	  pve =3D NULL;
> 	  pmap->pm_stats.resident_count++ ;
> 	  if (wired) =

> 	    pmap->pm_stats.wired_count++;
> 	  if (ptp)
> 	    ptp->wire_count++;
3049a3167,3168
> =

> =

3058a3178,3180
> 		/* This is a managed page */
> 		npte |=3D PG_PVLIST;
> 		=

3063a3186,3203
> 					if (!pmap_valid_entry(opte))
> 					{				 =

> 						/* Repair statistics ! */
> 						=

> 						if(wired)
> 							pmap->pm_stats.wired_count--;
> 						=

> 						=

> 						pmap->pm_stats.resident_count--;
> 						if (ptp)
> 							ptp->wire_count--;
> 					}
> 					else
> 					{
> 						pmap->pm_stats.wired_count -=3D ( (npte & PG_W) ? 1 : 0 =

> 										- (opte & PG_W) ? 1 : 0);
> 					}
> 					=

3068a3209,3212
> 			opte =3D x86_atomic_testset_ul(&ptes[x86_btop(va)],npte);   /* zap! =
*/
> 		} else {
> 		  /* keep opte */
> 			x86_atomic_testset_ul(&ptes[x86_btop(va)],npte);   /* zap! */
3069a3214
> =

3075d3219
< 		pvh =3D NULL;		/* ensure !PG_PVLIST */
3076a3221,3222
> 		  {
> 		    =

3077a3224,3229
> 			x86_atomic_testset_ul(&ptes[x86_btop(va)],npte);   /* zap! */
> 		  }
> 		else
> 		  {
> 			  opte =3D x86_atomic_testset_ul(&ptes[x86_btop(va)],npte);   /* zap=
! */
> 		  }
3080,3101d3231
< enter_now:
< 	/*
< 	 * at this point pvh is !NULL if we want the PG_PVLIST bit set
< 	 */
< =

< 	pmap->pm_stats.resident_count +=3D resdelta;
< 	pmap->pm_stats.wired_count +=3D wireddelta;
< 	if (ptp)
< 		ptp->wire_count +=3D ptpdelta;
< 	npte =3D pa | protection_codes[prot] | PG_V;
< 	if (pvh)
< 		npte |=3D PG_PVLIST;
< 	if (wired)
< 		npte |=3D PG_W;
< 	if (va < VM_MAXUSER_ADDRESS)
< 		npte |=3D PG_u;
< 	else if (va < VM_MAX_ADDRESS)
< 		npte |=3D (PG_u | PG_RW);	/* XXXCDC: no longer needed? */
< 	if (pmap =3D=3D pmap_kernel())
< 		npte |=3D pmap_pg_g;
< =

< 	ptes[x86_btop(va)] =3D npte;		/* zap! */
3103,3107c3233,3242
< 	/*
< 	 * If we changed anything other than modified/used bits,
< 	 * flush the TLB.  (is this overkill?)
< 	 */
< 	if ((opte & ~(PG_M|PG_U)) !=3D npte) {
---
> 	/* Update page attributes if needed */
> 	if( (opte & (PG_V | PG_U)) !=3D  (PG_V | PG_U))
> 	  {
> 	    /* Not in TLB */
> 	    goto out_ok;
> 	  }
> 	=

>  shootdown_now:
> 	   =

> 	{
3109c3244
< 		int32_t cpumask =3D 0;
---
> 	  int32_t cpumask =3D 0;
3111,3112c3246,3247
< 		pmap_tlb_shootdown(pmap, va, opte, &cpumask);
< 		pmap_tlb_shootnow(cpumask);
---
> 	  pmap_tlb_shootdown(pmap, va, opte, &cpumask);
> 	  pmap_tlb_shootnow(cpumask);
3114,3116c3249,3251
< 		/* Don't bother deferring in the single CPU case. */
< 		if (pmap_is_curpmap(pmap))
< 			pmap_update_pg(va);
---
> 	  /* Don't bother deferring in the single CPU case. */
> 	  if (pmap_is_curpmap(pmap))
> 	    pmap_update_pg(va);
3120d3254
< 	error =3D 0;
3122c3256,3259
< out:
---
>  out_ok:	=

> 	=

> 	error =3D 0;
>  out:

--==_Exmh_-20711097060--