Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang

To: port-evbarm-maintainer%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost,pjledge%me.com@localhost
Subject: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang
From: Brad Spencer <brad%anduin.eldar.org@localhost>
Date: Wed, 27 Jul 2022 23:50:01 +0000 (UTC)

The following reply was made to PR port-evbarm/56944; it has been noted by GNATS.

From: Brad Spencer <brad%anduin.eldar.org@localhost>
To: gnats-bugs%netbsd.org@localhost
Cc: port-evbarm-maintainer%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
        netbsd-bugs%netbsd.org@localhost, pjledge%me.com@localhost
Subject: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1
 VM results in kernel thread running away and filesystem hang
Date: Wed, 27 Jul 2022 19:47:49 -0400

 Tobias Nygren <tnn%NetBSD.org@localhost> writes:

 > The following reply was made to PR port-evbarm/56944; it has been noted by GNATS.
 >
 > From: Tobias Nygren <tnn%NetBSD.org@localhost>
 > To: gnats-bugs%netbsd.org@localhost
 > Cc: 
 > Subject: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1
 >  VM results in kernel thread running away and filesystem hang
 > Date: Wed, 27 Jul 2022 19:18:27 +0200
 >
 >  When pagedaemon is spinning it is indicative of a memory pressure
 >  situation that is unresolvable. The interaction between pagedaemon and
 >  zfs is primarily ARC reclamation. Some observations:

 I believe that I can reproduce this problem on demand with a Xen PVH
 amd64 guest.  I have one that if I do a particular zfs receve to it I
 can cause a hang that appears to be what is being described in just a
 couple of minutes.

 I watched the system while I performed the zfs receive that hangs with
 top and "vmstat -m".  I didn't notice that top caught the pagedaemon
 running away before the hang (although that may just be a display
 thing.. and top was unable to print due to the hang), but I did notice
 that the pool named "zio_data_buf_51" that is of size 1024 (there appear
 to be two by that name) was increasing quite a lot during the receive.
 The hang happened when that pool hit around 30000 requests.  I can get
 into ddb on the Xen console of the guest when the hang happens and a ps
 there has a ">" next to the pagedaemon process, which I think means that
 it was running.  I should probably mention that this is a 9.99.98 guest,
 so not the most recent -current.

 >  1) It doesn't look like we initialise zfs_arc_free_target, unlike FreeBSD.
 >  2) FreeBSD has additional code to check for kva fragmentation which
 >     we do not.
 >  
 >  So it might be worthwhile to experiment with zfs_arc_free_target to
 >  preemptively avoid the situation where the kernel fails to reclaim enough
 >  pages to continue working. Here's a patch for zfs.kmod you could try:

 I tried this patch on the mentioned Xen guest and as far as I can tell
 it did not seem to help the situation I am seeing.  The system is
 running with an otherwise unmodified arc.c file.

 >  --- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	4 May 2022 15:49:55 -0000	1.21
 >  +++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	27 Jul 2022 17:10:16 -0000
 >  @@ -387,7 +387,7 @@ int zfs_arc_grow_retry = 0;
 >   int zfs_arc_shrink_shift = 0;
 >   int zfs_arc_p_min_shift = 0;
 >   uint64_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
 >  -u_int zfs_arc_free_target = 0;
 >  +u_int zfs_arc_free_target = 32 * 1024 * 1024;
 >  
 >   /* Absolute min for arc min / max is 16MB. */
 >   static uint64_t arc_abs_min = 16 << 20;
 >  @@ -3919,6 +3919,14 @@ arc_available_memory(void)
 >   		r = FMR_LOTSFREE;
 >   	}
 >  
 >  +#ifdef __NetBSD__
 >  +	n = PAGESIZE * ((int64_t)freemem - desfree);
 >  +	if (n < lowest) {
 >  +		lowest = n;
 >  +		r = FMR_LOTSFREE;
 >  +	}
 >  +#endif
 >  +
 >  

 I should also mention that if I let /etc/daily run on this guest it will
 also hang the system probably when the core file check or some like that
 runs across the ZFS file set.  I have not been entirely able to narrow
 down which part of the daily cron run is tripping the hang, but I do
 know that the hang disappears if I comment out /etc/daily from the root
 cron tab.  I also see this exact same hang on a fairly new 9.x Xen guest
 that has ZFS filesets on it, with the same "solution" of commenting out
 /etc/daily.

 -- 
 Brad Spencer - brad%anduin.eldar.org@localhost - KC8VKS - http://anduin.eldar.org

Prev by Date: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang
Next by Date: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang
Previous by Thread: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang
Next by Thread: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang
Indexes:

Home | Main Index | Thread Index | Old Index