NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx network interfaces in use
The following reply was made to PR port-amd64/53155; it has been noted by GNATS.
From: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
To: "gnats-bugs%NetBSD.org@localhost" <gnats-bugs%netbsd.org@localhost>
Cc: port-amd64-maintainer%netbsd.org@localhost, gnats-admin%netbsd.org@localhost,
netbsd-bugs%netbsd.org@localhost, Havard Eidnes <he%netbsd.org@localhost>
Subject: Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx network
interfaces in use
Date: Thu, 17 May 2018 10:50:59 +0900
On Wed, May 16, 2018 at 8:41 PM Havard Eidnes <he%netbsd.org@localhost> wrote:
> The following reply was made to PR port-amd64/53155; it has been noted by
GNATS.
> From: Havard Eidnes <he%NetBSD.org@localhost>
> To: gnats-bugs%NetBSD.org@localhost
> Cc:
> Subject: Re: port-amd64/53155: OS wedges after <12h uptime when >2 bnx
> network interfaces in use
> Date: Wed, 16 May 2018 13:37:28 +0200 (CEST)
> [[ Hmm, here's a non-quoted-printable version of the previous ]]
> Hi,
> we provoked another wedge, and captured a kernel core dump from
> the wedging machine. The kernel core dump and NetBSD images are
> available for looking at.
> using gdb and crash, here's a brief summary of the "interesting"
> processes in the crash dump, and below that I include backtraces
> of all the waiting processes.
> There's lots of contention for fstrans_lock.
> Offhand I don't see a deadlock which might explain the observed
> behaviour (goes totally "deaf" on the network, i.e. doesn't even
> respond to ping).
> It doesn't look like gdb can trace through interrupt frames (?),
> looking at proc 788 all I get is:
> (gdb) kvm proc 0xfffffe8220b8e360
> 0xffffffff8021cfe0 in softintr_ret ()
> (gdb) where
> #0 0xffffffff8021cfe0 in softintr_ret ()
> #1 0x0000000000000000 in ?? ()
> (gdb)
> Crash manages to do this one, though, apparently (see below).
> Furthermore, using the various gdb scripts in
> /usr/src/sys/gdbscripts/ I can look at some of the locks.
> It is quite possible that up'ing the interface in question causes
> lots of activity for opening pty pairs, and that the root cause
> of the issue is there rather than related to networking in itself(?)
> Further hints?
> Regards,
> - Havard
> ------------------------------
> PID LID S CPU FLAGS STRUCT LWP * NAME WAIT
> 7052 1 3 1 8020000 fffffe8220c58540 cron tstile
> Wants fstrans_lock
> 9187 1 3 6 8020000 fffffe821faa60c0 expect xchicv
> Holds fstrans_lock, in pserialize_perform, waits on condition variable
> after doing xc_broadcast(XC_HIGHPRI, nullop)
> Doing (roughly) pty_grant_slave -> genfs_revoke -> vfs_suspend ->
> fstrans_setstate -> pserialize_perform -> xc_wait -> cv_wait
This xcall requires that the softint of SOFTINT_SERIAL (softser/N)
on all CPUs processes a callback of the xcall. If any of the softints
get stuck for some reason, the xcall never finish.
Could you show the stack trace of each softser/N? In particular softser/0
looks running and is a suspect.
Thanks,
ozaki-r
Home |
Main Index |
Thread Index |
Old Index