NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx network interfaces in use



On Wed, May 16, 2018 at 8:41 PM Havard Eidnes <he%netbsd.org@localhost> wrote:

> The following reply was made to PR port-amd64/53155; it has been noted by
GNATS.

> From: Havard Eidnes <he%NetBSD.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc:
> Subject: Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx
>    network interfaces in use
> Date: Wed, 16 May 2018 13:37:28 +0200 (CEST)

>    [[ Hmm, here's a non-quoted-printable version of the previous ]]

>    Hi,

>    we provoked another wedge, and captured a kernel core dump from
>    the wedging machine.  The kernel core dump and NetBSD images are
>    available for looking at.


>    using gdb and crash, here's a brief summary of the "interesting"
>    processes in the crash dump, and below that I include backtraces
>    of all the waiting processes.

>    There's lots of contention for fstrans_lock.

>    Offhand I don't see a deadlock which might explain the observed
>    behaviour (goes totally "deaf" on the network, i.e. doesn't even
>    respond to ping).

>    It doesn't look like gdb can trace through interrupt frames (?),
>    looking at proc 788 all I get is:

>    (gdb) kvm proc 0xfffffe8220b8e360
>    0xffffffff8021cfe0 in softintr_ret ()
>    (gdb) where
>    #0  0xffffffff8021cfe0 in softintr_ret ()
>    #1  0x0000000000000000 in ?? ()
>    (gdb)

>    Crash manages to do this one, though, apparently (see below).

>    Furthermore, using the various gdb scripts in
>    /usr/src/sys/gdbscripts/ I can look at some of the locks.

>    It is quite possible that up'ing the interface in question causes
>    lots of activity for opening pty pairs, and that the root cause
>    of the issue is there rather than related to networking in itself(?)

>    Further hints?

>    Regards,

>    - Havard

>    ------------------------------

>    PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
>    7052     1 3   1   8020000   fffffe8220c58540               cron tstile
>      Wants fstrans_lock

>    9187     1 3   6   8020000   fffffe821faa60c0             expect xchicv
>      Holds fstrans_lock, in pserialize_perform, waits on condition variable
>        after doing xc_broadcast(XC_HIGHPRI, nullop)
>      Doing (roughly) pty_grant_slave -> genfs_revoke -> vfs_suspend ->
>        fstrans_setstate -> pserialize_perform -> xc_wait -> cv_wait

This xcall requires that the softint of SOFTINT_SERIAL (softser/N)
on all CPUs processes a callback of the xcall. If any of the softints
get stuck for some reason, the xcall never finish.

Could you show the stack trace of each softser/N? In particular softser/0
looks running and is a suspect.

Thanks,
      ozaki-r


Home | Main Index | Thread Index | Old Index