Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)

To: Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost>
Subject: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
From: PHO <pho%cielonegro.org@localhost>
Date: Sat, 22 Jul 2023 21:52:40 +0900

On 7/22/23 20:48, Taylor R Campbell wrote:

A note about one of the problems there:

	spin_unlock(f->lock);
	ret = dma_fence_add_callback(f, &cb.base, vmwgfx_wait_cb);
	spin_lock(f->lock);
#if defined(__NetBSD__)
	/* This is probably an upstream bug: there is a time window between
	 * the call of vmw_fence_obj_signaled() above, and this
	 * dma_fence_add_callback(). If the fence gets signaled during it
	 * dma_fence_add_callback() returns -ENOENT, which is really not an
	 * error condition. By the way, why the heck does dma_fence work in
	 * this way? If a callback is being added but it lost the race, why
	 * not just call it immediately as if it were just signaled?
	 */

Not an upstream bug -- I introduced this bug when I patched the code
that reached into the guts of what should be an opaque data structure
for direct modification, to use drm_fence_add_callback instead.

Need to look at the diff from upstream, not just the #ifdefs.  Usually
I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms,
and just patch the code when the patched code can use a common API
that isn't any one OSism.

In this case I don't even remember why I left any #ifdefs, was
probably just working fast to make progress on a large code base,
might have left the #ifdefs in for visual reference while I was
editing the code and forgot to remove them.  Could also simplify some
of the lock/unlock cycles by doing that.

Ah okay. I used #if defined(__NetBSD__) for everything needing anychanges, and I assumed you did the same without actually checking theoriginal code.

    cv_destroy(&cv); // <-- Panics!

It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not
always. The panic seems to happen when cv_timedwait_sig() exits due to
the timeout expiring before it gets signaled.


Confused by `seldom panics on ... but not always' -- was that supposed
to be `often panics on ... but not always', or is there a more
frequent panic than KASSERT(!cv_has_waiters(cv))?

I meant it didn't panic for most cases as if nothing wrong happened, butit occasionally panicked due to KASSERT(!cv_has_waiters(cv)). Sorry formy bad English.

What exactly is the panic you see and the evidence when you see it?
Stack trace, gdb print cb in crash dump?

Wait, can we use gdb for examining the kernel dump? I thought gdbcouldn't read it. Here's the stacktrace found in /var/log/message:

Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3652130] panic: kerneldiagnostic assertion "!cv_has_waiters(cv)" failed: file"/home/pho/sandbox/_netbsd/src/sys/kern/kern_condvar.c", line 108Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3782663] cpu0: Begintraceback...Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5355447] vpanic() atnetbsd:vpanic+0x173Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5454410] kern_assert()at netbsd:kern_assert+0x4bJul 17 00:52:34 netbsd-current /netbsd: [ 64017.5551143] cv_destroy() atnetbsd:cv_destroy+0x8aJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161]vmw_fence_wait() at netbsd:vmw_fence_wait+0xdcJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161]linux_dma_fence_wait_timeout() at netbsd:linux_dma_fence_wait_timeout+0x8bJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161]linux_dma_resv_wait_timeout_rcu() atnetbsd:linux_dma_resv_wait_timeout_rcu+0xbeJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] ttm_bo_wait()at netbsd:ttm_bo_wait+0x4cJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241]vmw_resource_unbind_list() at netbsd:vmw_resource_unbind_list+0x103Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241]vmw_move_notify() at netbsd:vmw_move_notify+0x16Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6351198]ttm_bo_handle_move_mem() at netbsd:ttm_bo_handle_move_mem+0xe6Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175]ttm_mem_evict_first() at netbsd:ttm_mem_evict_first+0x702Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175]ttm_bo_mem_space() at netbsd:ttm_bo_mem_space+0x21eJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175]ttm_bo_validate() at netbsd:ttm_bo_validate+0xe6Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178]vmw_validation_bo_validate_single() atnetbsd:vmw_validation_bo_validate_single+0x93Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178]vmw_validation_bo_validate() at netbsd:vmw_validation_bo_validate+0xaaJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178]vmw_execbuf_process() at netbsd:vmw_execbuf_process+0x771Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177]vmw_execbuf_ioctl() at netbsd:vmw_execbuf_ioctl+0x97Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177] drm_ioctl() atnetbsd:drm_ioctl+0x23dJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234]drm_ioctl_shim() at netbsd:drm_ioctl_shim+0x25Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] sys_ioctl() atnetbsd:sys_ioctl+0x56dJul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] syscall() atnetbsd:syscall+0x196Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] --- syscall(number 54) ---Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255]netbsd:syscall+0x196:Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] cpu0: Endtraceback...

Follow-Ups:
- Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
  - From: Taylor R Campbell
- Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
  - From: Taylor R Campbell

References:
- Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
  - From: Taylor R Campbell

Prev by Date: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
Next by Date: Re: DRM/KMS: vmwgfx driver is now available
Previous by Thread: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
Next by Thread: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
Indexes:

Home | Main Index | Thread Index | Old Index