Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950

To: Tobias Nygren <tnn%NetBSD.org@localhost>
Subject: Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950
From: Mindaugas Rasiukevicius <rmind%netbsd.org@localhost>
Date: Mon, 14 Dec 2009 20:54:02 +0000

Hello,

Tobias Nygren <tnn%NetBSD.org@localhost> wrote:
>  It tripped over again. Backtrace is similar to before but not identical.
>  Looks like lock recursion now (notice the bnx interrupt).
>  Would it be possible (and safe?) to return immediately without doing any
>  work if mutex_owned()?

Now this is a locking bug.  Do you mean using mutex_owned() to make locking
decisions?  In such case - no, it would be very wrong, and would also not
work on spin-mutex.

>  panic: lock error
>  cpu_Debugger() at netbsd:cpu_Debugger+0x9
>  panic() at netbsd:panic+0x1f6
>  lockdebug_abort() at netbsd:lockdebug_abort+0x8f
>  mutex_abort() at netbsd:mutex_abort+0x29
>  mutex_vector_enter() at netbsd:mutex_vector_enter+0x1c4
>  pool_cache_invalidate() at netbsd:pool_cache_invalidate+0x23
>  pool_reclaim() at netbsd:pool_reclaim+0x69
>  pool_reclaim_callback() at netbsd:pool_reclaim_callback+0x41
>  callback_run_roundrobin() at netbsd:callback_run_roundrobin+0x100
>  ...

From the backtrace, it seems there are three paths competing on the same
thing, basically - reclaim on VA cache of kmem_map (since more layers are
involved, like vmem quantum cache, it goes through pool subsystem couple
times).  The following interrupt happens (3rd path) while reclaiming, and
it tries to reclaim again from interrupt context and probably locks against
oneself ("lock error" would be meaningful with LOCKDEBUG, in this case):

> bnx_intr() at netbsd:bnx_intr+0xf1
> intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
> Xintr_ioapic_level1() at netbsd:Xintr_ioapic_level1+0xf4
> --- interrupt ---
> mutex_enter() at netbsd:mutex_enter+0x11
> pool_reclaim() at netbsd:pool_reclaim+0x69
> pool_reclaim_callback() at netbsd:pool_reclaim_callback+0x41

This is a bit confusing.  Since kmem_map is VM_MAP_INTRSAFE, pool should be
interrupt-safe too i.e. run at IPL_VM and that mutex should be a spin-lock,
blocking bnx_intr() as it runs at IPL_NET (== IPL_VM).

Unfortunately, I had not have time yet to figure out more, but can add some
KASSERT()s if you are OK to crash machine a little bit more? :)

-- 
Mindaugas

References:
- Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950
  - From: Tobias Nygren

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950
Previous by Thread: Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950
Next by Thread: Re: port-amd64/39283: Kernel crash on Dell Poweredge 2950
Indexes:

Home | Main Index | Thread Index | Old Index