tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Strange crash of DIAGNOSTIC kernel on cv_destroy(9)



On 7/22/23 20:48, Taylor R Campbell wrote:

A note about one of the problems there:

	spin_unlock(f->lock);
	ret = dma_fence_add_callback(f, &cb.base, vmwgfx_wait_cb);
	spin_lock(f->lock);
#if defined(__NetBSD__)
	/* This is probably an upstream bug: there is a time window between
	 * the call of vmw_fence_obj_signaled() above, and this
	 * dma_fence_add_callback(). If the fence gets signaled during it
	 * dma_fence_add_callback() returns -ENOENT, which is really not an
	 * error condition. By the way, why the heck does dma_fence work in
	 * this way? If a callback is being added but it lost the race, why
	 * not just call it immediately as if it were just signaled?
	 */

Not an upstream bug -- I introduced this bug when I patched the code
that reached into the guts of what should be an opaque data structure
for direct modification, to use drm_fence_add_callback instead.

Need to look at the diff from upstream, not just the #ifdefs.  Usually
I use #ifdef __NetBSD__ to mark NetBSDisms separately from Linuxisms,
and just patch the code when the patched code can use a common API
that isn't any one OSism.

In this case I don't even remember why I left any #ifdefs, was
probably just working fast to make progress on a large code base,
might have left the #ifdefs in for visual reference while I was
editing the code and forgot to remove them.  Could also simplify some
of the lock/unlock cycles by doing that.

Ah okay. I used #if defined(__NetBSD__) for everything needing any changes, and I assumed you did the same without actually checking the original code.

    cv_destroy(&cv); // <-- Panics!

It seldom panics on KASSERT(!cv_has_waiters(cv)) in cv_destroy() but not
always. The panic seems to happen when cv_timedwait_sig() exits due to
the timeout expiring before it gets signaled.

Confused by `seldom panics on ... but not always' -- was that supposed
to be `often panics on ... but not always', or is there a more
frequent panic than KASSERT(!cv_has_waiters(cv))?

I meant it didn't panic for most cases as if nothing wrong happened, but it occasionally panicked due to KASSERT(!cv_has_waiters(cv)). Sorry for my bad English.

What exactly is the panic you see and the evidence when you see it?
Stack trace, gdb print cb in crash dump?

Wait, can we use gdb for examining the kernel dump? I thought gdb couldn't read it. Here's the stacktrace found in /var/log/message:

Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3652130] panic: kernel diagnostic assertion "!cv_has_waiters(cv)" failed: file "/home/pho/sandbox/_netbsd/src/sys/kern/kern_condvar.c", line 108 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.3782663] cpu0: Begin traceback... Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5355447] vpanic() at netbsd:vpanic+0x173 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5454410] kern_assert() at netbsd:kern_assert+0x4b Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.5551143] cv_destroy() at netbsd:cv_destroy+0x8a Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] vmw_fence_wait() at netbsd:vmw_fence_wait+0xdc Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] linux_dma_fence_wait_timeout() at netbsd:linux_dma_fence_wait_timeout+0x8b Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6151161] linux_dma_resv_wait_timeout_rcu() at netbsd:linux_dma_resv_wait_timeout_rcu+0xbe Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] ttm_bo_wait() at netbsd:ttm_bo_wait+0x4c Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] vmw_resource_unbind_list() at netbsd:vmw_resource_unbind_list+0x103 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6251241] vmw_move_notify() at netbsd:vmw_move_notify+0x16 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6351198] ttm_bo_handle_move_mem() at netbsd:ttm_bo_handle_move_mem+0xe6 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175] ttm_mem_evict_first() at netbsd:ttm_mem_evict_first+0x702 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175] ttm_bo_mem_space() at netbsd:ttm_bo_mem_space+0x21e Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6451175] ttm_bo_validate() at netbsd:ttm_bo_validate+0xe6 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178] vmw_validation_bo_validate_single() at netbsd:vmw_validation_bo_validate_single+0x93 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178] vmw_validation_bo_validate() at netbsd:vmw_validation_bo_validate+0xaa Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6551178] vmw_execbuf_process() at netbsd:vmw_execbuf_process+0x771 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177] vmw_execbuf_ioctl() at netbsd:vmw_execbuf_ioctl+0x97 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6651177] drm_ioctl() at netbsd:drm_ioctl+0x23d Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] drm_ioctl_shim() at netbsd:drm_ioctl_shim+0x25 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] sys_ioctl() at netbsd:sys_ioctl+0x56d Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6751234] syscall() at netbsd:syscall+0x196 Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] --- syscall (number 54) --- Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] netbsd:syscall+0x196: Jul 17 00:52:34 netbsd-current /netbsd: [ 64017.6851255] cpu0: End traceback...

Home | Main Index | Thread Index | Old Index