NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: kern/52263: Frequent ixg(4) panic



Hi.

On 2017/05/31 1:05, Hauke Fath wrote:
Number:         52263
Category:       kern
Synopsis:       Frequent ixg(4) panic in ixgbe_rxeof()
Confidential:   no
Severity:       critical
Priority:       high
Responsible:    kern-bug-people
State:          open
Class:          sw-bug
Submitter-Id:   net
Arrival-Date:   Tue May 30 16:05:00 +0000 2017
Originator:     Hauke Fath
Release:        NetBSD 7.99.73
Organization:
Technische Universitaet Darmstadt
Environment:
	
	
System: NetBSD Zinnenwand 7.99.73 NetBSD 7.99.73 (FIFI-$Revision$) #0: Mon May 29 17:00:08 CEST 2017 hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI amd64
Architecture: x86_64
Machine: amd64
Description:

	A pr & carp router under current (7.99.73 here, but happens in
	yesterday's .75, too) panics frequently with

NetBSD 7.99.73 (FIFI-$Revision$) #2: Fri May 26 15:51:24 CEST 2017
         hf@Hochstuhl:/var/obj/netbsd-builds/developer/amd64/sys/arch/amd64/compile/FIFI

[...]

fatal protection fault in supervisor mode
trap type 4 code 0 rip 0xffffffff8029646d cs 0x8 rflags 0x10202 cr2 0xffff80008f799000 ilevel 0x8 rsp 0xfffffe810e8aeeb0
curlwp 0xfffffe810e89d4c0 pid 0.18 lowest kstack 0xfffffe810e8ab2c0
panic: trap
cpu1: Begin traceback...
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0xbab
--- trap (number 4) ---
ixgbe_rxeof() at netbsd:ixgbe_rxeof+0x523
ixgbe_handle_que() at netbsd:ixgbe_handle_que+0x98
softint_dispatch() at netbsd:softint_dispatch+0xd4
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e8aeff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
0:
cpu1: End traceback...
rebooting...

	According to objdump(1) probing, the relevant instruction is
	at sys/dev/pci/ixgbe/ix_txrx.c:1933

    1922                         /*
    1923                          * Optimize.  This might be a small packet,
    1924                          * maybe just a TCP ACK.  Do a fast copy that
    1925                          * is cache aligned into a new mbuf, and
    1926                          * leave the old mbuf+cluster for re-use.
    1927                          */
    1928                         if (eop && len <= IXGBE_RX_COPY_LEN) {
    1929                                 sendmp = m_gethdr(M_NOWAIT, MT_DATA);
    1930                                 if (sendmp != NULL) {
    1931                                         sendmp->m_data +=
    1932                                             IXGBE_RX_COPY_ALIGN;
    1933                                         ixgbe_bcopy(mp->m_data,
    1934                                             sendmp->m_data, len);
    1935                                         sendmp->m_len = len;
    1936                                         rxr->rx_copies.ev_count++;
    1937                                         rbuf->flags |= IXGBE_RX_COPY;
    1938                                 }
    1939                         }

	I tried to KASSERT() for zero pointers, but it wasn't that
	easy.

	Sometimes I also see

fatal protection fault in supervisor mode
trap type 4 code 0 rip 0xffffffff8061e443 cs 0x8 rflags 0x10202 cr2 0x6b1e00 ilevel 0x4 rsp 0xfffffe810e913ef0
curlwp 0xfffffe810e904540 pid 0.30 lowest kstack 0xfffffe810e9102c0
panic: trap
cpu3: Begin traceback...
vpanic() at netbsd:vpanic+0x140
snprintf() at netbsd:snprintf
trap() at netbsd:trap+0xbab
--- trap (number 4) ---
ether_input() at netbsd:ether_input+0x83
if_percpuq_softint() at netbsd:if_percpuq_softint+0x5b
softint_dispatch() at netbsd:softint_dispatch+0xd4
DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfffffe810e913ff0
Xsoftintr() at netbsd:Xsoftintr+0x4f
--- interrupt ---
f557b81a7cde3fa1:
cpu3: End traceback...
rebooting...


How-To-Repeat:

	Run serious amounts of traffic over an ixg(4) equipped pf/carp
	router machine - 9 vlans here.

 Does this problem still occur?

I suspect this is not ixg(4)'s bug but pf's bug.
Have you ever tested without pf?

 The following change avoid using the optimization, but
it won't solve your machine's proble,

------------------
Index: ix_txrx.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/ixgbe/ix_txrx.c,v
retrieving revision 1.27
diff -u -p -r1.27 ix_txrx.c
--- ix_txrx.c	13 Jun 2017 09:37:22 -0000	1.27
+++ ix_txrx.c	10 Aug 2017 04:40:59 -0000
@@ -1915,6 +1915,7 @@ ixgbe_rxeof(struct ix_queue *que)
 			 * is cache aligned into a new mbuf, and
 			 * leave the old mbuf+cluster for re-use.
 			 */
+#if 0
 			if (eop && len <= IXGBE_RX_COPY_LEN) {
 				sendmp = m_gethdr(M_NOWAIT, MT_DATA);
 				if (sendmp != NULL) {
@@ -1927,6 +1928,7 @@ ixgbe_rxeof(struct ix_queue *que)
 					rbuf->flags |= IXGBE_RX_COPY;
 				}
 			}
+#endif
 			if (sendmp == NULL) {
 				rbuf->buf = rbuf->fmp = NULL;
 				sendmp = mp;
------------------


	Happens once every few hours here, so I can provide details,
	and/or try things easily.
	
	
Fix:
	I'd love to.

	

Unformatted:
  	
  	



--
-----------------------------------------------
                SAITOH Masanobu (msaitoh%execsw.org@localhost
                                 msaitoh%netbsd.org@localhost)


Home | Main Index | Thread Index | Old Index