NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/59557: deadlock in ure(4), sysctl(9), and suspend/resume



>Number:         59557
>Category:       kern
>Synopsis:       deadlock in ure(4), sysctl(9), and suspend/resume
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Jul 26 02:05:00 +0000 2025
>Originator:     Taylor R Campbell
>Release:        10
>Organization:
The DeadlockUSB Foundation
>Environment:
>Description:

	1. I had a ure(4) device plugged in.
	2. I suspended my laptop.
	3. I unplugged the ure(4) device.
	4. I resumed my laptop.

	At this point, various processes wedged.  New processes all
	hung waiting for sysctl_treelock.  The sysctl_treelock was held
	as a _writer_ by the `sysctl -w hw.acpi.sleep.state=3' process,
	which in turn was waiting in:

		sleepq_block
		cv_wait
		usbd_transfer
		usbd_do_request_len
		usbd_do_request
		ure_ctl.isra.0
		ure_uno_mii_write_reg
		mii_phy_reset
		rgephy_reset
		mii_phy_resume
		device_pmf_driver_resume
		pmf_device_resume
		pmf_system_resume
		acpi_enter_sleep_state
		sysctl_hw_acpi_sleepstate
		sysctl_dispatch
		sys___sysctl
		syscall

	Two USB event threads were stuck in:

		sleepq_block
		turnstile_block
		rw_vector_enter
		sysctl_teardown
		ubt_detach
		config_detach
		usb_disconnect_port
		uhub_explore
		usb_discover
		usb_event_thread

		sleepq_block
		turnstile_block
		mutex_vector_enter
		usbnet_stop
		usbnet_detach
		config_detach
		usb_disconnect_port
		uhub_explore
		usb_discover
		usb_event_thread

	One of the USB task threads was stuck in:

		sleepq_block
		turnstile_block
		mutex_vector_enter
		usbnet_tick_task
		usb_task_thread

	Relevant parts of autoconf device tree:

	    xhci0
	      usb0
	        uhub0
	          umass0
	            scsibus0
	              sd0
	          umass1
	            scsibus1
	              sd1
	          ure0
	            rgephy0
	      usb1
	        uhub1
	          uhidev0
	            uhid0
	          ugenif0
	          ubt0
	          uvideo0
	            video0
	          ugen0

	(I also have xhci1 with usb2->uhub2 and usb3->uhub3, but there
	are no USB devices on uhub2 or uhub3.)


>How-To-Repeat:

	1. plug in ure(4)
	2. suspend
	3. unplug ure(4)
	4. resume


>Fix:

	First, it's not clear to me why a write lock must be taken on
	sysctl_treelock when we're only writing to a sysctl node, but
	not modifying the tree:
	https://nxr.netbsd.org/xref/src/sys/kern/kern_sysctl.c?r=1.271#312
	This may not be the cause of the deadlock, but it is the cause
	of various processes wedging even though they're not doing
	anything with ure(4).

	The USB event thread with a stack trace through ubt_detach
	appears to be a red herring, blocked by the real deadlock that
	happens to occur while holding sysctl_treelock write-locked.
	Similarly, the USB task thread with a stack trace through
	usbnet_tick_task appears to be collateral damage.

	The part I can't explain yet is this:

	=> pmf_system_resume resumes xhci0 first, then ure0, then
	   rgephy0 (omitting intermediate nodes), which calls
	   mii_phy_resume, which takes the mii lock (struct
	   usbnet_private::unp_miilock) and then waits for a USB
	   transfer -- which will never complete because the devices is
	   gone, but it _also_ isn't being aborted.

	Then, once xhci0 is resumed and delivers an interrupt to report
	that a device has been disconnected via interrupt, the USB
	event thread tries to config_detach(ure0) which does
	usbnet_detach -> usbnet_stop -> mutex_enter(unp->unp_miilock).

	I anticipated this problem in usb_subr.c rev. 1.270 back in
	2022 when I changed usb_disconnect_port to call
	usbd_suspend_pipe before calling config_detach -- this causes
	any subsequent usbd_transfer calls to fail with USBD_CANCELLED
	by setting pipe->up_aborting = true, and calls the bus's
	upm_abort method for every transfer already queued.

	So that _should_ have caused the transfer in ure_ctl to wake up
	and fail.  But why didn't it?  I don't know!




Home | Main Index | Thread Index | Old Index