NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx network interfaces in use



The following reply was made to PR port-amd64/53155; it has been noted by GNATS.

From: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
To: "gnats-bugs%NetBSD.org@localhost" <gnats-bugs%netbsd.org@localhost>
Cc: port-amd64-maintainer%netbsd.org@localhost, gnats-admin%netbsd.org@localhost, 
	netbsd-bugs%netbsd.org@localhost, Havard Eidnes <he%netbsd.org@localhost>
Subject: Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx network
 interfaces in use
Date: Thu, 17 May 2018 10:50:59 +0900

 On Wed, May 16, 2018 at 8:41 PM Havard Eidnes <he%netbsd.org@localhost> wrote:
 
 > The following reply was made to PR port-amd64/53155; it has been noted by
 GNATS.
 
 > From: Havard Eidnes <he%NetBSD.org@localhost>
 > To: gnats-bugs%NetBSD.org@localhost
 > Cc:
 > Subject: Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx
 >    network interfaces in use
 > Date: Wed, 16 May 2018 13:37:28 +0200 (CEST)
 
 >    [[ Hmm, here's a non-quoted-printable version of the previous ]]
 
 >    Hi,
 
 >    we provoked another wedge, and captured a kernel core dump from
 >    the wedging machine.  The kernel core dump and NetBSD images are
 >    available for looking at.
 
 
 >    using gdb and crash, here's a brief summary of the "interesting"
 >    processes in the crash dump, and below that I include backtraces
 >    of all the waiting processes.
 
 >    There's lots of contention for fstrans_lock.
 
 >    Offhand I don't see a deadlock which might explain the observed
 >    behaviour (goes totally "deaf" on the network, i.e. doesn't even
 >    respond to ping).
 
 >    It doesn't look like gdb can trace through interrupt frames (?),
 >    looking at proc 788 all I get is:
 
 >    (gdb) kvm proc 0xfffffe8220b8e360
 >    0xffffffff8021cfe0 in softintr_ret ()
 >    (gdb) where
 >    #0  0xffffffff8021cfe0 in softintr_ret ()
 >    #1  0x0000000000000000 in ?? ()
 >    (gdb)
 
 >    Crash manages to do this one, though, apparently (see below).
 
 >    Furthermore, using the various gdb scripts in
 >    /usr/src/sys/gdbscripts/ I can look at some of the locks.
 
 >    It is quite possible that up'ing the interface in question causes
 >    lots of activity for opening pty pairs, and that the root cause
 >    of the issue is there rather than related to networking in itself(?)
 
 >    Further hints?
 
 >    Regards,
 
 >    - Havard
 
 >    ------------------------------
 
 >    PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 >    7052     1 3   1   8020000   fffffe8220c58540               cron tstile
 >      Wants fstrans_lock
 
 >    9187     1 3   6   8020000   fffffe821faa60c0             expect xchicv
 >      Holds fstrans_lock, in pserialize_perform, waits on condition variable
 >        after doing xc_broadcast(XC_HIGHPRI, nullop)
 >      Doing (roughly) pty_grant_slave -> genfs_revoke -> vfs_suspend ->
 >        fstrans_setstate -> pserialize_perform -> xc_wait -> cv_wait
 
 This xcall requires that the softint of SOFTINT_SERIAL (softser/N)
 on all CPUs processes a callback of the xcall. If any of the softints
 get stuck for some reason, the xcall never finish.
 
 Could you show the stack trace of each softser/N? In particular softser/0
 looks running and is a suspect.
 
 Thanks,
       ozaki-r
 


Home | Main Index | Thread Index | Old Index