Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,kardel%netbsd.org@localhost
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
From: Frank Kardel <kardel%netbsd.org@localhost>
Date: Thu, 3 Aug 2023 14:15:02 +0000 (UTC)
The following reply was made to PR kern/57558; it has been noted by GNATS.

From: Frank Kardel <kardel%netbsd.org@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: 
Subject: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Date: Thu, 3 Aug 2023 16:11:51 +0200

 This is a multi-part message in MIME format.
 --------------100111F35787E6220A86651D
 Content-Type: text/plain; charset=utf-8; format=flowed
 Content-Transfer-Encoding: 7bit
 
 Sure
 
 Setup:
      - all userlans NetBSD-10.0_BETA
      - NetBSD 10.0_BETA (2023-07-26) (-current should al work) XEN3_DOM0 
 (pagedaemon patched see pd.diff attachment)
      - xen-4.15.1
      - NetBSD 10.0_BETA GENERIC as DOMU
      - on DOM0 zfs file system providing a file for the FFS file system 
 in the DOMU
      - DOMU has a posgresql 14.8 installation
      - testcase is load a significant database (~200 Gb) into the 
 postgres DB.
 
 this seems complicated to setup (but I am prepaing the kind of VM for 
 our purposes).
 Going by the errors detected it should al be possible (not tested)
      - creste ZFS file syystem on a plain GENERIC system
      - create a file system file in ZFS
      - vnconfig vndX <path the file system file>
      - disklabel vndX
      - newfs vndXa
      - mount /dev/vndXa /mnt
      - do lots of fs traffic writing, deleting, rewriting the mount fs
 
 Part 1 - current situation:
 
 Use
 sdt:::arc-available_memory
 {
          printf("mem = %d, reason = %d", arg0, arg1);
 }
 
 to track what ZFS thinks is has as memory - positive values mean enough 
 memory there, negative ask ZFS ARC to free the much of memory.
 
 Use vmstat -m to track pool usage - you should see that ZFS will take 
 more an more memory until 90% kmem is used in the pools.
 At the point you should see a ~100% busy pgdaemon in top and
 the pagedaemon patch should list high counts for loop, kvm_starved and 
 available as uvm_availmem(false) still reports many free pages.
 
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9813.2250709] pagedaemon: loops=16023729, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16023729, cnt_starved=16023729, 
 cnt_avail=16023729, fpages=336349
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9819.2252810] pagedaemon: loops=16018349, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16018349, cnt_starved=16018349, 
 cnt_avail=16018349, fpages=336542
 /var/log/messages.0.gz:Jul 28 17:42:41 Marmolata /netbsd: [ 
 9825.2255258] pagedaemon: loops=16025793, cnt_needsfree=0, 
 cnt_needsscan=0, cnt_drain=16025793, cnt_starved=16025793, 
 cnt_avail=16025793, fpages=336516
 ...
 
 That document the tight loop with no progress. the pgdaemon will not 
 recover - see my analysis.
 Observe the arc_reclaim is not freeing anything (and collect no cpu 
 time  see top) because arc_available memory claims that there is enough 
 free memory (looks at uvm_availmem(false).
 The dtrace probe documents that.
 
 Part 2 - get the arc_reclaim thread to actually be triggered before kmem 
 is starving.
 Install Patch 1 from the bug report. This lets ZFS look at the 
 kmem_arena space situation which a also looked at 
 uvm_km.c:uvm_km_va_starved_p(void).
 Now ZFS has a chance to start reclaiming memory.
 Run the load test again.
 The dtrace probe should now show decreasing memory until it get 
 negative. And it will stay negative by a certain amount.
 vmstat -m should show that ZFS now only hogs ~75% of kmem.
 Also the should be a significant count in the Ide page counts as the 
 arc_reclaim thread did give up memory.
 As the idle page are not yet reclaim from the pool ZFS is asked to 
 always free memory (dtrace probe) an vmstat -m will
 show the non zero Idle page counts. Thus now ZFS has 75% kmem memory 
 allocated but utilized only a small part. Thus the cache
 is allocated but not used anymore.
 
 We need to get the Idle pages actually reclaimed from the pools. This is 
 done by Patch 2 from the bug.
 There is no way to pass this task to the pgdaemon the that looks only 
 uvm_availmem(false) that does not consider kmem unless starving. Also
 the pool drain thread drain one pool at a time per invocation and that 
 is not even triggered.
 so Patch 2 directly reclaims from the pool_cache_invalidate()ed pool.
 
 With this strategy ZFS keeps the kmem usage around 75% as now Idle pages 
 are reclaimed and ZFS only gets negative arc_available_memory
 values when called for.
 vmstat will show that ZFS is now in the 75% kmem limit. arc_reclaim will 
 run at a suitable rate when needed. ZFS pools should not show too many 
 idle pages (idle pages
 are removed after some cool down time to reduce xcall activity if I read 
 the code right).
 dtrace should show positive and negative arc_available memory figures.
 
 I did not keep the vmstat and dtrace and top outputs. But from a busy db 
 loading DOMU ( databases > 350 GB)
 I see a vmstat -m of
 
 Memory resource pool statistics
 Name        Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg 
 Maxpg Idle
 ...
 zfs_znode_cache 248 215697   0        0 13482     0 13482 13482 0   inf    0
 zil_lwb_cache 208      84    0        0     5     0     5     5 0   inf    0
 zio_buf_1024 1536   11248    0     7612  3278  1460  1818  1818 0   inf    0
 zio_buf_10240 10240  1130    0      723   973   566   407   407 0   inf    0
 zio_buf_114688 114688 351    0      200   339   188   151   151 0   inf    0
 zio_buf_12288 12288  1006    0      714   721   429   292   305 0   inf    0
 zio_buf_131072 131072 3150  89     2176  1841   867   974   974 0   inf    0
 zio_buf_14336 14336   473    0      308   432   267   165   166 0   inf    0
 zio_buf_1536 2048    2060    0     1065   549    51   498   498 0   inf    0
 zio_buf_16384 16384  9672    0      481  9318   127  9191  9191 0   inf    0
 zio_buf_2048 2048    2001    0      826   682    94   588   588 0   inf    0
 zio_buf_20480 20480   461    0      301   428   268   160   160 0   inf    0
 zio_buf_24576 24576   448    0      293   404   249   155   155 0   inf    0
 zio_buf_2560 2560    2319    1      490  1948   119  1829  1829 0   inf    0
 zio_buf_28672 28672   369    0      221   345   197   148   152 0   inf    0
 zio_buf_3072 3072    4163    2      422  3861   120  3741  3741 0   inf    0
 ...
 zio_buf_7168 7168     506    0      292   465   251   214   214 0   inf    0
 zio_buf_8192 8192     724    0      329   635   240   395   395 0   inf    0
 zio_buf_81920 81920   379    0      229   371   221   150   161 0   inf    0
 zio_buf_98304 98304   580    0      421   442   283   159   163 0   inf    0
 zio_cache    992     4707    0        0  1177     0  1177  1177 0   inf    0
 zio_data_buf_10 1536   39    0       33    20    17     3    12 0   inf    0
 zio_data_buf_10 10240   2    0        2     2     2     0     2 0   inf    0
 zio_data_buf_13 131072 488674 0  323782 274996 110104 164892 191800 0   
 inf    0
 zio_data_buf_15 2048   25    0       19    13    10     3     7 0   inf    0
 zio_data_buf_20 2048   17    0       13     9     7     2     4 0   inf    0
 zio_data_buf_20 20480   1    0        1     1     1     0     1 0   inf    0
 zio_data_buf_25 2560    7    0        6     7     6     1     5 0   inf    0
 ...
 Totals           222323337  98 210229180 1033080 125800 907280
 
 In use 24951773K, total allocated 25255540K; utilization 98.8%
 
 In the unpatched case all 32GB where allocated.
 
 The arc_reclaim_thread clocked in 20 CPU sec - that is ok.
 
 Current dtrace output is:
 dtrace: script 'zfsmem.d' matched 1 probe
 CPU     ID                    FUNCTION:NAME
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
    7    274        none:arc-available_memory mem = 384434176, reason = 2
    1    274        none:arc-available_memory mem = 384434176, reason = 2
 
 The page daemon was never woken up and has 0 CPU seconds. in 2 days.
 
 This all looks very much as desired.
 
 Hope this helps.
 
 Best regards,
    Frank
 
 
 --------------100111F35787E6220A86651D
 Content-Type: text/x-patch;
  name="pd.diff"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
  filename="pd.diff"
 
 --- /src/NetBSD/n10/src/sys/uvm/uvm_pdaemon.c	2023-07-29 17:52:46.392362932 +0200
 +++ /src/NetBSD/n10/src/sys/uvm/.#uvm_pdaemon.c.1.133	2023-07-29 14:18:05.000000000 +0200
 @@ -270,11 +270,15 @@
  	/*
  	 * main loop
  	 */
 -
 +/*XXXkd*/ unsigned long cnt_needsfree = 0L, cnt_needsscan = 0, cnt_drain = 0, cnt_starved = 0, cnt_avail = 0, cnt_loops = 0;
 +/*XXXkd*/ time_t ts, last_ts = time_second;
  	for (;;) {
  		bool needsscan, needsfree, kmem_va_starved;
  
 +/*XXXkd*/ cnt_loops++;
 +
  		kmem_va_starved = uvm_km_va_starved_p();
 +/*XXXkd*/ if (kmem_va_starved) cnt_starved++;
  
  		mutex_spin_enter(&uvmpd_lock);
  		if ((uvm_pagedaemon_waiters == 0 || uvmexp.paging > 0) &&
 @@ -311,6 +315,8 @@
  		needsfree = fpages + uvmexp.paging < uvmexp.freetarg;
  		needsscan = needsfree || uvmpdpol_needsscan_p();
  
 +/*XXXkd*/ if (needsfree) cnt_needsfree++;
 +/*XXXkd*/ if (needsscan) cnt_needsscan++;
  		/*
  		 * scan if needed
  		 */
 @@ -328,8 +334,18 @@
  			wakeup(&uvmexp.free);
  			uvm_pagedaemon_waiters = 0;
  			mutex_spin_exit(&uvmpd_lock);
 +/*XXXkd*/		cnt_avail++;
  		}
  
 +/*XXXkd*/	if (needsfree || kmem_va_starved) cnt_drain++;
 +/*XXXkd*/	ts = time_second;
 +/*XXXkd*/	if (ts > last_ts + 5 && cnt_loops > 5 * 10000) {
 +/*XXXkd*/		printf("pagedaemon: loops=%ld, cnt_needsfree=%ld, cnt_needsscan=%ld, cnt_drain=%ld, cnt_starved=%ld, cnt_avail=%ld, fpages=%d\n",
 +/*XXXkd*/		       cnt_loops, cnt_needsfree, cnt_needsscan, cnt_drain, cnt_starved, cnt_avail, fpages);
 +/*XXXkd*/ 		cnt_needsfree = cnt_needsscan = cnt_drain = cnt_starved = cnt_avail = cnt_loops = 0;
 +/*XXXkd*/		last_ts = ts;
 +/*XXXkd*/	}
 +
  		/*
  		 * scan done.  if we don't need free memory, we're done.
  		 */
 
 --------------100111F35787E6220A86651D--
Prev by Date: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Next by Date: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Previous by Thread: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Next by Thread: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Indexes:
Home | Main Index | Thread Index | Old Index