Re: kern/49462: if_slowtimo callout mangled?

To: gnats-bugs%netbsd.org@localhost
Subject: Re: kern/49462: if_slowtimo callout mangled?
From: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
Date: Thu, 11 Dec 2014 10:51:06 +0900

On Thu, Dec 11, 2014 at 2:35 AM,  <martin%netbsd.org@localhost> wrote:
>>Number:         49462
>>Category:       kern
>>Synopsis:       if_slowtimo callout mangled?
>>Confidential:   no
>>Severity:       critical
>>Priority:       high
>>Responsible:    kern-bug-people
>>State:          open
>>Class:          sw-bug
>>Submitter-Id:   net
>>Arrival-Date:   Wed Dec 10 17:35:00 +0000 2014
>>Originator:     Martin Husemann
>>Release:        NetBSD 7.99.2
>>Organization:
> The NetBSD Foundation, Inc.
>>Environment:
> System: NetBSD thirdstage.duskware.de 7.99.2 NetBSD 7.99.2 (MODULAR) #237: Wed Dec 10 17:53:26 CET 2014 martin%thirdstage.duskware.de@localhost:/usr/src/sys/arch/sparc64/compile/MODULAR sparc64
> Architecture: sparc64
> Machine: sparc64
>>Description:
>
> I get a (not quite, but pretty often) reproducable KASSERT firing on reboot
> on a SMP sparc64 machine with four bge interfaces:
>
> panic: kernel diagnostic assertion "(c->c_flags & CALLOUT_PENDING) == 0" failed: file "../../../../kern/kern_timeout.c", line 314 callout 0x103b29aa8: c_func (0x119f100) c_flags (0x3) destroyed from 0x11a0e40
>
> The function is if_slowtimo and the caller of callout_destroy() is
> if_detach.
>
> However, this is basically impossible:
>
>          if (ifp->if_slowtimo != NULL) {
>                  callout_halt(ifp->if_slowtimo_ch, NULL);
>                  callout_destroy(ifp->if_slowtimo_ch);
>                  kmem_free(ifp->if_slowtimo_ch, sizeof(*ifp->if_slowtimo_ch));
>          }
>
> and callout_halt() certainly kills bit 0 in flags (CALLOUT_BOUND), and should
> also not return before CALLOUT_PENDING has cleared.
>
> So something(tm) is wrong, but it is not obvious to me.

It can happen. See PR 47881 and this thread:
http://mail-index.netbsd.org/netbsd-bugs/2014/11/12/msg039065.html .

The problem is that callout_halt waits for a callout handler
completion but doesn't prevent the handler from scheduling
a new callout.

I fixed the problem by using an existing flag of the user (in6m->in6m_timer)
as "don't callout_schedule anymore" flag for callout. I think the fix can
be applied to this case.

Nonetheless, I'm thinking that we maybe should do it in callout_halt itself.
For example, introduce CALLOUT_HALTING flag and set it before waiting a
callout handler finished, while callout_schedule first checks the flag and
do nothing if the flag is set. By doing so, we can prevent a new callout
from being scheduled during callout_halt.

Off topic: callout_schedule_locked takes a (held) mutex but it's just
released only just before returning. We can release the mutex outside
callout_schedule_locked so that we don't need to pass it at all.

  ozaki-r

>
>>How-To-Repeat:
> Reboot an MP sparc64 machine with a -current DIAGNOSTIC kernel, best from a
> ssh login.
>
>>Fix:
> n/a
>

References:
- kern/49462: if_slowtimo callout mangled?
  - From: martin

Prev by Date: NetBSD Nightly Trouble Ticket Report
Next by Date: Re: kern/49462: if_slowtimo callout mangled?
Previous by Thread: kern/49462: if_slowtimo callout mangled?
Next by Thread: Re: kern/49462: if_slowtimo callout mangled?
Indexes:

Home | Main Index | Thread Index | Old Index