Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)

To: PHO <pho%cielonegro.org@localhost>
Subject: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
From: Taylor R Campbell <campbell+netbsd-tech-kern%mumble.net@localhost>
Date: Sat, 22 Jul 2023 11:48:34 +0000

> Date: Mon, 17 Jul 2023 12:57:42 +0900
> From: PHO <pho%cielonegro.org@localhost>
> 
> While further testing my DRM/KMS vmwgfx driver patches by playing 
> games/minetest from pkgsrc, I experienced inexplicable kernel panics on 
> this code:
> 
> https://github.com/depressed-pho/netbsd-src/blob/91daa67f17222da355d3fddd6fa849c786d9c545/sys/external/bsd/drm2/dist/drm/vmwgfx/vmwgfx_fence.c#L289

A note about one of the problems there:

	spin_unlock(f->lock);
	ret = dma_fence_add_callback(f, &cb.base, vmwgfx_wait_cb);
	spin_lock(f->lock);
#if defined(__NetBSD__)
	/* This is probably an upstream bug: there is a time window between
	 * the call of vmw_fence_obj_signaled() above, and this
	 * dma_fence_add_callback(). If the fence gets signaled during it
	 * dma_fence_add_callback() returns -ENOENT, which is really not an
	 * error condition. By the way, why the heck does dma_fence work in
	 * this way? If a callback is being added but it lost the race, why
	 * not just call it immediately as if it were just signaled?
	 */

Not an upstream bug -- I introduced this bug when I patched the code
that reached into the guts of what should be an opaque data structure
for direct modification, to use drm_fence_add_callback instead.

Need to look at the diff from upstream, not just the #ifdefs.  Usually
I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms,
and just patch the code when the patched code can use a common API
that isn't any one OSism.

In this case I don't even remember why I left any #ifdefs, was
probably just working fast to make progress on a large code base,
might have left the #ifdefs in for visual reference while I was
editing the code and forgot to remove them.  Could also simplify some
of the lock/unlock cycles by doing that.

>    cv_destroy(&cv); // <-- Panics!
> 
> It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not 
> always. The panic seems to happen when cv_timedwait_sig() exits due to 
> the timeout expiring before it gets signaled.

Confused by `seldom panics on ... but not always' -- was that supposed
to be `often panics on ... but not always', or is there a more
frequent panic than KASSERT(!cv_has_waiters(cv))?

What exactly is the panic you see and the evidence when you see it?
Stack trace, gdb print cb in crash dump?

Follow-Ups:
- Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
  - From: PHO

References:
- Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
  - From: PHO

Prev by Date: Re: [10.0-BETA] Reproductible panic (filesystem)
Next by Date: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
Previous by Thread: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
Next by Thread: Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)
Indexes:

Home | Main Index | Thread Index | Old Index