kern/58198: ZFS can lead to UVM kills (no swap, out of swap)

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/58198: ZFS can lead to UVM kills (no swap, out of swap)
From: kardel%netbsd.org@localhost
Date: Fri, 26 Apr 2024 14:50:00 +0000 (UTC)

>Number:         58198
>Category:       kern
>Synopsis:       ZFS can lead to UVM kills (no swap, out of swap)
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Fri Apr 26 14:50:00 +0000 2024
>Originator:     kardel%netbsd.org@localhost
>Release:        NetBSD 10.0_STABLE
>Organization:
	
>Environment:
	
	
System: NetBSD AlpineTest.acrys.com 10.0_STABLE NetBSD 10.0_STABLE (GENERIC) #2: Thu Apr 25 19:55:18 CEST 2024 kardel%gaia.acrys.com@localhost:/src/NetBSD/n10/src/obj.amd64/sys/arch/amd64/compile/GENERIC amd64
Architecture: x86_64
Machine: amd64
>Description:
	Using ZFS can lead to UVM killed due to out of swap or no swap space
	configured.

	This has been observed on systems with sufficiently large memory footprint
	(e. g. 240GB) on a Xen DOMU hvm GENERIC kernel.
	The use case is for example a parallel (8 times) load of a database.
	The out of swap / no swap space kill happens around the time
	when ZFS should start latest releasing pool memory.

	The hypothesis is, that even though enough physical memory is
	available, ZFS eventually hogs most pool memory and does not free
	up pool memory resources *in time* before the pagedaemon decides it
	needs to swap. At that point swap is required or the process is killed
	when no swap is available. 

	With swap available, processing continues and ZFS frees some of
	its pool memory any all continues.

	So, though there is no real resource reason to needing to use swap
	the ZFS / pagedaemon/swap / UVM interaction is at best suboptimal.

	There should be no need to have swap space available when running
	a database (or any other writing process) on ZFS as ZFS can evict
	data always to storage.

>How-To-Repeat:
	Set up a Xen DOMU with significant memory and no swap. Create a database on ZFS
	and load a larger database backup with a higher value of paralellism (e.g. 8).
	Sit back watch ZFS consume pool memory. Once almost all poolmemory is consumed
	some unlucky processes may be UVM killed.

	Apr 22 09:32:02 Alpine-next /netbsd: [ 8871047.9934402] UVM: pid 26090.6694 (java), uid 1802 killed: out of swap
	Apr 22 09:32:02 Alpine-next /netbsd: [ 8871047.9934402] UVM: pid 1944.1944 (postgres), uid 1003 killed: out of swap
	Apr 22 09:32:02 Alpine-next /netbsd: [ 8871047.9934402] UVM: pid 23508.26256 (java), uid 1802 killed: out of swap

>Fix:
	rework ZFS/pagedaemon communication so overshoots into using swap before ZFS manages to free memory cannot happen.
	This might require a better coordination with ZFS as currently the pooldrain mechanism is too asynchronous so
	more memory requests while the pooldraining was just triggered leads to swap usage.

	Alternatively ZFS could be limited to not consume so much memory.

	May be related to PR kern/57558 (there ZFS does not free resources though it could)

>Unformatted:

Prev by Date: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Next by Date: Re: kern/57558: pgdaemon 100% busy - no scanning (ZFS case)
Previous by Thread: toolchain/58197: [RB] Generated CTF data depends on the building host
Next by Thread: toolchain/58200: ./makeobsolete uses platform's sed
Indexes:

Home | Main Index | Thread Index | Old Index