kern/55485: bwfm @ sdmmc hangs, with some diagnosis

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/55485: bwfm @ sdmmc hangs, with some diagnosis
From: mrg%eterna.com.au@localhost
Date: Mon, 13 Jul 2020 10:15:00 +0000 (UTC)

>Number:         55485
>Category:       kern
>Synopsis:       bwfm @ sdmmc hangs, with some diagnosis
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Jul 13 10:15:00 +0000 2020
>Originator:     matthew green
>Release:        NetBSD 9.99.69
>Organization:
people's front against (bozotic) www (softwar foundation)
>Environment:
System: NetBSD hinotori.eterna23.net 9.99.68 NetBSD 9.99.68 (_hinotori_) #14: Sun Jul 12 22:34:00 PDT 2020  mrg%yesterday-when-i-was-mad.eterna23.net@localhost:/var/obj/evbarm-aarch64/usr/src2/sys/arch/evbarm/compile/_hinotori_ evbarm
Architecture: aarch64
Machine: evbarm
>Description:

	we've been tracking hangs in bwfm(4) on the pinebookpro for a
	while and most of the time there is no information, but i've
	found some useful data related to this.

	first, NET_MPSAFE kernel doesn't hard-hang the kernel when the
	problem occurs.  this is because he paths into the bwfm code
	that hang aren't holding kernel_lock when they hang.  you only
	get the specific network interface to hang.

	second, in the reproduction i just had, i found that the
	hung 'ifconfig -a' command was stuck waiting for the bwfm
	adaptive mutex in bwfm_sdio_txctl() with this trace:

fp ffffc00088d07780 mi_switch() at ffffc000004abecc netbsd:mi_switch+0x20c
fp ffffc00088d077e0 sleepq_block() at ffffc000004a88bc netbsd:sleepq_block+0x9c
fp ffffc00088d07820 turnstile_block() at ffffc000004ba0bc netbsd:turnstile_block+0x30c
fp ffffc00088d07890 mutex_enter() at ffffc0000048f58c netbsd:mutex_enter+0x1cc
fp ffffc00088d07910 bwfm_sdio_txctl() at ffffc000000ee9c0 netbsd:bwfm_sdio_txctl+0x60
fp ffffc00088d07940 bwfm_proto_bcdc_query_dcmd() at ffffc000001ea7d4 netbsd:bwfm_proto_bcdc_query_dcmd+0xd4
fp ffffc00088d079a0 bwfm_fwvar_var_get_data() at ffffc000001ec61c netbsd:bwfm_fwvar_var_get_data+0x9c
fp ffffc00088d07a00 bwfm_get_sta_info() at ffffc000001ed5c8 netbsd:bwfm_get_sta_info+0x58
fp ffffc00088d07b60 bwfm_ioctl() at ffffc000001ed808 netbsd:bwfm_ioctl+0x178
fp ffffc00088d07ba0 doifioctl() at ffffc0000057e778 netbsd:doifioctl+0x8f8
fp ffffc00088d07d40 sys_ioctl() at ffffc000004ee770 netbsd:sys_ioctl+0x420
fp ffffc00088d07e20 syscall() at ffffc0000008da2c netbsd:syscall+0x18c
tf ffffc00088d07ed0 el0_trap() at ffffc0000008c524 netbsd:el0_trap

	this lock is:

* Lock 0 (initialized at bwfm_sdio_attach)
lock address : 0xffffc00001b2d180 type     :     sleep/adaptive
initialized  : 0xffffc000000ef710
shared holds :                  0 exclusive:                  1
shares wanted:                  0 exclusive:                  1
relevant cpu :                  5 last held:                  5
relevant lwp : 0xffff0000f687b080 last held: 0xffff0000f687b080
last locked* : 0xffffc000000efb00 unlocked : 0xffffc000000efb98
owner field  : 0xffff0000f687b080 wait/spin:                1/0
Turnstile:
=> 0 waiting readers:
=> 1 waiting writers: 0xffff00002af94b00

	it is held by the sdmmc task thread for bwfm:

fp ffffc00088008b00 splx() at ffffc000000041b0 netbsd:splx+0x80
fp ffffc00088008b30 callout_halt() at ffffc000004b9444 netbsd:callout_halt+0x54
fp ffffc00088008b60 sleepq_block() at ffffc000004a8940 netbsd:sleepq_block+0x120
fp ffffc00088008ba0 cv_timedwait() at ffffc0000046d71c netbsd:cv_timedwait+0x11c
fp ffffc00088008be0 cv_timedwaitbt() at ffffc0000046da74 netbsd:cv_timedwaitbt+0x74
fp ffffc00088008c20 dwc_mmc_exec_command() at ffffc000001e5ac4 netbsd:dwc_mmc_exec_command+0x4b4
fp ffffc00088008ca0 sdmmc_mmc_command() at ffffc000000e80d0 netbsd:sdmmc_mmc_command+0x40
fp ffffc00088008cd0 sdmmc_io_rw_extended() at ffffc000000e9374 netbsd:sdmmc_io_rw_extended+0xa4
fp ffffc00088008d90 sdmmc_io_read_4() at ffffc000000e9d74 netbsd:sdmmc_io_read_4+0x24
fp ffffc00088008db0 bwfm_sdio_task() at ffffc000000efb24 netbsd:bwfm_sdio_task+0x54
fp ffffc00088008e90 sdmmc_task_thread() at ffffc000000e8a00 netbsd:sdmmc_task_thread+0xb0

	this thread also holds this lock (which seems unrelated):

* Lock 1 (initialized at dwc_mmc_init)
lock address : 0xffff0000f68f0250 type     :     sleep/adaptive
initialized  : 0xffffc000001e6108
shared holds :                  0 exclusive:                  1
shares wanted:                  0 exclusive:                  0
relevant cpu :                  5 last held:                  5
relevant lwp : 0xffff0000f687b080 last held: 0xffff0000f687b080
last locked* : 0xffffc000001e5634 unlocked : 0xffffc000001e5cbc
owner field  : 0xffff0000f687b080 wait/spin:                0/0
Turnstile: no active turnstile for this lock.

	this task thread is spinning waiting for a condition:

    574 static void
    575 dwc_mmc_exec_command(sdmmc_chipset_handle_t sch, struct sdmmc_command *cmd)
...
    722         struct bintime timeout = { .sec = 15, .frac = 0 };
    723         const struct bintime epsilon = { .sec = 1, .frac = 0 };
    724         while (!ISSET(cmd->c_flags, SCF_ITSDONE)) {
    725                 error = cv_timedwaitbt(&sc->sc_intr_cv,
    726                     &sc->sc_intr_lock, &timeout, &epsilon);
    727                 if (error != 0) {
    728                         cmd->c_error = error;
    729                         SET(cmd->c_flags, SCF_ITSDONE);
    730                         mutex_exit(&sc->sc_intr_lock);
    731                         goto done;
    732                 }
    733         }

this appears to be a loop that should exist after 15 seconds if
the condition fails to come true, but it isn't exiting.  this
leaves both the dwc_mmc_init() mutex and the bwfm_sdio_attach()
mutex locked while this doesn't exit, and the bwfm hangs.

as i was attemting to look at the in-memory contents of the
timeout and epsilon values on the stack, i lost ddb and i had
to power cycle.

>How-To-Repeat:

	use bwfm.  use a lossy network for faster hangs.

>Fix:

Follow-Ups:
- re: kern/55485: bwfm @ sdmmc hangs, with some diagnosis
  - From: matthew green

Prev by Date: PR/55477 CVS commit: src
Next by Date: Re: bin/55477 (The rationale of /etc/pkgpath.conf is unclear and undocumented)
Previous by Thread: PR/55477 CVS commit: src
Next by Thread: re: kern/55485: bwfm @ sdmmc hangs, with some diagnosis
Indexes:

Home | Main Index | Thread Index | Old Index