NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950



Hello,

Tobias Nygren <tnn%NetBSD.org@localhost> wrote:
>  It tripped over again. Backtrace is similar to before but not identical.
>  Looks like lock recursion now (notice the bnx interrupt).
>  Would it be possible (and safe?) to return immediately without doing any
>  work if mutex_owned()?

Now this is a locking bug.  Do you mean using mutex_owned() to make locking
decisions?  In such case - no, it would be very wrong, and would also not
work on spin-mutex.

>  panic: lock error
>  cpu_Debugger() at netbsd:cpu_Debugger+0x9
>  panic() at netbsd:panic+0x1f6
>  lockdebug_abort() at netbsd:lockdebug_abort+0x8f
>  mutex_abort() at netbsd:mutex_abort+0x29
>  mutex_vector_enter() at netbsd:mutex_vector_enter+0x1c4
>  pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x23
>  pool_reclaim() at netbsd:pool_reclaim+0x69
>  pool_reclaim_callback() at netbsd:pool_reclaim_callback+0x41
>  callback_run_roundrobin() at netbsd:callback_run_roundrobin+0x100
>  ...

From the backtrace, it seems there are three paths competing on the same
thing, basically - reclaim on VA cache of kmem_map (since more layers are
involved, like vmem quantum cache, it goes through pool subsystem couple
times).  The following interrupt happens (3rd path) while reclaiming, and
it tries to reclaim again from interrupt context and probably locks against
oneself ("lock error" would be meaningful with LOCKDEBUG, in this case):

> bnx_intr() at netbsd:bnx_intr+0xf1
> intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
> Xintr_ioapic_level1() at netbsd:Xintr_ioapic_level1+0xf4
> --- interrupt ---
> mutex_enter() at netbsd:mutex_enter+0x11
> pool_reclaim() at netbsd:pool_reclaim+0x69
> pool_reclaim_callback() at netbsd:pool_reclaim_callback+0x41

This is a bit confusing.  Since kmem_map is VM_MAP_INTRSAFE, pool should be
interrupt-safe too i.e. run at IPL_VM and that mutex should be a spin-lock,
blocking bnx_intr() as it runs at IPL_NET (== IPL_VM).

Unfortunately, I had not have time yet to figure out more, but can add some
KASSERT()s if you are OK to crash machine a little bit more? :)

-- 
Mindaugas


Home | Main Index | Thread Index | Old Index