NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1 VM results in kernel thread running away and filesystem hang



Tobias Nygren <tnn%NetBSD.org@localhost> writes:

> The following reply was made to PR port-evbarm/56944; it has been noted by GNATS.
>
> From: Tobias Nygren <tnn%NetBSD.org@localhost>
> To: gnats-bugs%netbsd.org@localhost
> Cc: 
> Subject: Re: port-evbarm/56944: ZFS heavy usage on NetBSD running in Mac M1
>  VM results in kernel thread running away and filesystem hang
> Date: Wed, 27 Jul 2022 19:18:27 +0200
>
>  When pagedaemon is spinning it is indicative of a memory pressure
>  situation that is unresolvable. The interaction between pagedaemon and
>  zfs is primarily ARC reclamation. Some observations:

I believe that I can reproduce this problem on demand with a Xen PVH
amd64 guest.  I have one that if I do a particular zfs receve to it I
can cause a hang that appears to be what is being described in just a
couple of minutes.

I watched the system while I performed the zfs receive that hangs with
top and "vmstat -m".  I didn't notice that top caught the pagedaemon
running away before the hang (although that may just be a display
thing.. and top was unable to print due to the hang), but I did notice
that the pool named "zio_data_buf_51" that is of size 1024 (there appear
to be two by that name) was increasing quite a lot during the receive.
The hang happened when that pool hit around 30000 requests.  I can get
into ddb on the Xen console of the guest when the hang happens and a ps
there has a ">" next to the pagedaemon process, which I think means that
it was running.  I should probably mention that this is a 9.99.98 guest,
so not the most recent -current.

>  1) It doesn't look like we initialise zfs_arc_free_target, unlike FreeBSD.
>  2) FreeBSD has additional code to check for kva fragmentation which
>     we do not.
>  
>  So it might be worthwhile to experiment with zfs_arc_free_target to
>  preemptively avoid the situation where the kernel fails to reclaim enough
>  pages to continue working. Here's a patch for zfs.kmod you could try:

I tried this patch on the mentioned Xen guest and as far as I can tell
it did not seem to help the situation I am seeing.  The system is
running with an otherwise unmodified arc.c file.

>  --- external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	4 May 2022 15:49:55 -0000	1.21
>  +++ external/cddl/osnet/dist/uts/common/fs/zfs/arc.c	27 Jul 2022 17:10:16 -0000
>  @@ -387,7 +387,7 @@ int zfs_arc_grow_retry = 0;
>   int zfs_arc_shrink_shift = 0;
>   int zfs_arc_p_min_shift = 0;
>   uint64_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */
>  -u_int zfs_arc_free_target = 0;
>  +u_int zfs_arc_free_target = 32 * 1024 * 1024;
>  
>   /* Absolute min for arc min / max is 16MB. */
>   static uint64_t arc_abs_min = 16 << 20;
>  @@ -3919,6 +3919,14 @@ arc_available_memory(void)
>   		r = FMR_LOTSFREE;
>   	}
>  
>  +#ifdef __NetBSD__
>  +	n = PAGESIZE * ((int64_t)freemem - desfree);
>  +	if (n < lowest) {
>  +		lowest = n;
>  +		r = FMR_LOTSFREE;
>  +	}
>  +#endif
>  +
>  


I should also mention that if I let /etc/daily run on this guest it will
also hang the system probably when the core file check or some like that
runs across the ZFS file set.  I have not been entirely able to narrow
down which part of the daily cron run is tripping the hang, but I do
know that the hang disappears if I comment out /etc/daily from the root
cron tab.  I also see this exact same hang on a fairly new 9.x Xen guest
that has ZFS filesets on it, with the same "solution" of commenting out
/etc/daily.



-- 
Brad Spencer - brad%anduin.eldar.org@localhost - KC8VKS - http://anduin.eldar.org


Home | Main Index | Thread Index | Old Index