uvm/busy page deadlock in current (related to loading Raspberry Pi 3B+ Wi-Fi firmware, but more of a timing bug with the VM system)

To: current-users%netbsd.org@localhost, "port-arm%netbsd.org@localhost" <port-arm%NetBSD.org@localhost>
Subject: uvm/busy page deadlock in current (related to loading Raspberry Pi 3B+ Wi-Fi firmware, but more of a timing bug with the VM system)
From: Rob Newberry <robthedude%mac.com@localhost>
Date: Thu, 6 Feb 2020 16:31:47 -0800

Hi.

I spent last weekend -- and a few days this week -- tracking down a problem that exists in current.  I found a workaround, but I don't know what the "proper" fix is.  Digging through the VM layer and debugging with printfs was slow -- and it's a boot-time issue, so I had to swap a lot of SD cards back and forth :-).  Hopefully someone here is better at this than me.


Here's the background/setup:

	- I'm using a Raspberry Pi 3B+ in aarch64 mode

	- I'm building my own kernel, but that kernel is more or less the GENERIC64 kernel

	- I've added the "brcmfmac43455-sdio.[bin,txt]" files so that I can use the built-in Wi-Fi

	- I'm using a root file system on a RAM disk (I build a root file system image and then put it on the SD card, and then use efiboot/bootaa64 to load the RAM disk)

		(I don't know if that's relevant, but I wondered if using a RAM disk might have timing implications)

Here's the symptom:

	- After most of the boot up, but before the login prompt is presented, the system hangs.  Can't break into DDB either.

Here's what's actually going wrong:

	- The if_bwfm_sdio driver is loading the firmware, using the interfaces in dev/firmload.

	- The firmware for this device is pretty large (depending on which one you use [another, separate problem I hope is resolved soon :-)], between 450K-600K)

	- firmware_read calls vn_rdwr with a large request

	- that calls VOP_READ, which calls ffs_read, which then does this:

		- ubc_uiomove
			- ubc_alloc
				- breaks request of into 64K chunks
				- completes OK
			- uiomove
				- memcpy 64K
					==> generates a fault

	- the first fault then gets handled:

		- data_abort_handler
			- uvm_fault_internal
				- ubc_fault
					- uvn_get
						- uvm_ra_request
							- NOTE: it's the first time for this file, so no read ahead is
								actually posted, but because we got "no advice", it sets 
								one up for next time
							- completes
						- VOP_GETPAGES: note it's called with the SYNCIO flag
							- genfs_getpages (called with "async" set to false)
								uvm_findpages
									- uvm_pagealloc_strat for many new pages
									- returns all the pages, now marked busy
								genfs_getpages_read( async is false )
									- md_read (I'm using a RAM disk), called synchronously
										completes without error
									completes without error
								for each page
									- calls ubc_fault_page
										- marks each page as NOT BUSY
								completes without error
							completes without error
						completes without error
					completes without error
				completes without error

Everything's OK so far.  But we're not done.

The ubc_uiomove has many more 64K chunks to page in.  And in the next one, we run into the problem.  

When uiomove calls memcpy for the second chunk, the fault goes like this:

	- data_abort_handler
		- uvm_fault_internal
			- ubc_fault
				- uvn_get
					- uvm_ra_request
						- the previous fault set up a read-ahead request, so now we post it
							- and it gets posted from where we currently are (offset = 64K)
						- ra_startio
							- VOP_GETPAGES: but THIS TIME it's called WITHOUT the SYNCIO flag, which means
								- genfs_getpages: called with "async" set to TRUE
									- uvm_findpages:
										- uvm_pagealloc_strat for many new pages
										- returns all the pages, now marked busy
									- genfs_getpages_read( async is TRUE)
										- BUT WAIT!
										**** uvm.aiodone_queue is NULL -- IT HAS NOT BEEN STARTED YET ****
										**** the "async" flag gets reset to false for this routine ****
										md_read is called synchronously
											- completes without error
										- completes without error
								- completes without error:
									BUT:
										**** ALL THESE PAGES ARE STILL MARKED BUSY ****
								- completes without error
							- completes without error

					next, uvn_get calls:

					- VOP_GETPAGES: this time, it's called with the SYNCIO flag
						- genfs_getpages (called with "async" set to false)
							uvm_findpages
								- finds the first page at 64K
								- BUT THE PAGE IS MARKED BUSY (no one has marked then "un busy" yet -- and no-one will)
								- decides to wait
								- it will wait forever, because the previous call is already complete, and
									there's no aiodone queue processing anyway



Given all that, and the fact that I'm by no means an expert in any of this code (and the last thing I'd want to muck up is the VM layer), here are some work-around ideas.  But I'd certainly love someone to provide a real fix.

1) The SIMPLEST, easiest-to-apply workaround is to change firmload to pass "IO_ADV_ENCODE(POSIX_FADV_RANDOM)" to vn_rdwr.  This forces the lower layer to avoid posting read-ahead requests, which prevents the problem from happening.  But it's REALLY hacky as a way to avoid the problem, and it leaves the underlying bug there for something else to hit later.

	--- a/sys/dev/firmload.c
	+++ b/sys/dev/firmload.c
	@@ -317,7 +317,7 @@ firmware_read(firmware_handle_t fh, off_t offset, void *buf, size_t len)
	 {
 
	        return (vn_rdwr(UIO_READ, fh->fh_vp, buf, len, offset,
	-                       UIO_SYSSPACE, 0, kauth_cred_get(), NULL, curlwp));
	+                       UIO_SYSSPACE, IO_ADV_ENCODE(POSIX_FADV_RANDOM), kauth_cred_get(), NULL, curlwp));
	 }



2) Change the read-ahead code to short-circuit if uvm.adiodone_queue is NULL.  This also leaves an underlying bug in place, but this still might be the right thing to do.  If uvm.aiodone_queue is NULL, then the read-ahead code just generates page faults which are handled synchronously by the lower-layer (genfs_getpages_read switches to sync mode on its own), so it doesn't seem like there's a performance boost you're eliminating -- the "read-ahead" just happens synchronously before any actual read happens.

3) Start "aiodone_queue" earlier in the sequence.  I don't have a rich enough understanding of this part of the kernel and user land startup process to know how hard this is, or how hacky it is.

BTW, I'm ASSUMING that if uvm.aiodone_queue were present, the asynchronous completion would somehow handle marking the pages as "not busy".  But I actually never debugged that code path, so I can't be sure that's helpful.

Of course, someone who understands the VM layer can likely dig in and figure out what the RIGHT fix is.  It seems to me like the fact that genfs_getpages_read decides to operate synchronously on its own might need to be detected by the other layers, as they are currently behave as if their async requests will be handled asynchronously -- and they're not.  But honestly, I never figured out if that would change things -- I never debugged the case where the requests actually WERE being handled asynchronously.

I also don't know why others aren't seeing this -- although maybe they are.  I don't know how many people are trying out Wi-Fi on the 3B+.  Without the firmware files in the tree, maybe no ones really doing much with them yet.

Hope this is helpful.

Rob

Follow-Ups:
- Re: uvm/busy page deadlock in current (related to loading Raspberry Pi 3B+ Wi-Fi firmware, but more of a timing bug with the VM system)
  - From: Chuck Silvers

Prev by Date: Re: Panic on multiuser boot after dhcpcd with msk(4)
Next by Date: daily CVS update output
Previous by Thread: Panic on multiuser boot after dhcpcd with msk(4)
Next by Thread: Re: uvm/busy page deadlock in current (related to loading Raspberry Pi 3B+ Wi-Fi firmware, but more of a timing bug with the VM system)
Indexes:

Home | Main Index | Thread Index | Old Index