Re: [iscsi] Panic in ccb_timeout

To: NetBSD User Maillist <netbsd-users%NetBSD.org@localhost>
Subject: Re: [iscsi] Panic in ccb_timeout
From: BERTRAND Joël <joel.bertrand%systella.fr@localhost>
Date: Thu, 16 Nov 2023 11:20:05 +0100

	Nobody ?

	-10.0 seems to be unusable with iscsi initior (targets are QNAP NAS).
System randomly panics, sometimes after few hours, sometimes after few
weeks. It's not related to I/O (last panic occurs when both iSCSI
targets are idle. QNAP firmware switchs off disks when there is not
activity on target). During last panic, dmesg contains :

[ 2069698,164027] S3C1: freeing UNUSED pdu
[ 2090569,298079] fatal breakpoint trap in supervisor mode
[ 2090569,298079] trap type 1 code 0 rip 0xffffffff80235385 cs 0x8
rflags 0x202 cr2 0x7f07e8b036f8 ilevel 0x6 rsp 0xffff9b84372a1b00
[ 2090569,298079] curlwp 0xffff87b59440a080 pid 0.5 lowest kstack
0xffff9b843729d2c0
S3C1: freeing UNUSED pdu
[ 2421020,454350] S-1C-1: ccb_timeout: num=1 total=1 disp=0
[ 2421020,454350] uvm_fault(0xffffffff819755e0, 0x0, 2) -> e
[ 2421020,454350] fatal page fault in supervisor mode
[ 2421020,454350] trap type 6 code 0x2 rip 0xffffffff8023fa4c cs 0x8
rflags 0x10246 cr2 0x10 ilevel 0 rsp 0xffff9b8448074f68
[ 2421020,454350] curlwp 0xffff87b296e8f980 pid 0.325 lowest kstack
0xffff9b84480702c0
[ 2421020,454350] panic: trap
[ 2421020,454350] cpu3: Begin traceback...
[ 2421020,454350] vpanic() at netbsd:vpanic+0x183
[ 2421020,454350] panic() at netbsd:panic+0x3c
[ 2421020,454350] trap() at netbsd:trap+0xbaf
[ 2421020,454350] --- trap (number 6) ---
[ 2421020,454350] mutex_enter() at netbsd:mutex_enter+0xc
[ 2421020,454350] ccb_timeout() at netbsd:ccb_timeout+0xa6
[ 2421020,454350] iscsi_cleanup_thread() at
netbsd:iscsi_cleanup_thread+0x333
[ 2421020,454350] cpu3: End traceback...
[ 2421020,454350] dumping to dev 18,1 (offset=251919, size=4162814):
[ 2421020,454350] dump
[   1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003,
[     1,000000]     2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012, 2013,
[     1,000000]     2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021,
2022, 2023
[     1,000000]     The NetBSD Foundation, Inc.  All rights reserved.

but no dump is written in /var/crash. swap is on a raidframe slice,
maybe it's related. Please note that dmesg contains two iscsi errors
(S3C1: freeing UNUSED pdu, why ?) and another one (S-1C-1: ccb_timeout).
Maybe both errors are related, but I don't know if during last panic,
kernel complains about "freeing UNUSED pdu".

	iSCSI initiator is connected to two targets (ffs V2):

[    85.069032] scsibus0 at iscsi0: 1 target, 16 luns per target
[    85.069032] sd0 at scsibus0 target 0 lun 0: <QNAP, iSCSI Storage,
4.0> disk fixed
[    85.089034] scsibus1 at iscsi0: 1 target, 16 luns per target
[    85.099032] sd0: fabricating a geometry
[    85.099032] sd0: 10980 GB, 11244416 cyl, 64 head, 32 sec, 512
bytes/sect x 23028563968 sectors
[    85.119032] sd0: fabricating a geometry
[    85.149033] sd0: GPT GUID: a5d27c7c-8eda-40e8-a29b-e85a539a5bc7
[    85.149033] dk6 at sd0: "bacula", 23028563901 blocks at 34, type: ffs
[    85.149033] sd0: async, 8-bit transfers, tagged queueing
[    85.149033] sd1 at scsibus1 target 0 lun 0: <QNAP, iSCSI Storage,
4.0> disk fixed
[    85.169034] sd1: fabricating a geometry
[    85.169034] sd1: 10988 GB, 11251968 cyl, 64 head, 32 sec, 512
bytes/sect x 23044030464 sectors
[    85.199034] sd1: fabricating a geometry
[    85.219032] sd1: GPT GUID: 799b4d25-970c-4a32-a388-a59470280de0
[    85.229034] dk7 at sd1: "video", 23044030397 blocks at 34, type: ffs
[    85.229034] sd1: async, 8-bit transfers, tagged queueing
[    86.789035] /opt/bacula: replaying log to disk
[    87.269034] /opt/video: replaying log to disk

	These targets are directly connected to this server on bridge0 ethernet
interface:

bridge0: flags=0x41<UP,RUNNING> mtu 9000
   capabilities=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
   capabilities=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx>
        capabilities=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx>
        enabled=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
        enabled=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx>
        enabled=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx>
        status: active

and bridge0 is built with wm0 and wm1:

wm0:
flags=0x8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST>
mtu 9000
  capabilities=0x7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx>
  capabilities=0x7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx>
        capabilities=0x7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
        enabled=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
        enabled=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx>
        enabled=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx>
        ec_capabilities=0x17<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,EEE>
        ec_enabled=0x3<VLAN_MTU,VLAN_HWTAGGING>
        address: b4:96:91:92:77:6e
        media: Ethernet autoselect (1000baseT full-duplex,master)
        status: active
        inet6 fe80::b696:91ff:fe92:776e%wm0/64 flags 0 scopeid 0x1
        inet 192.168.12.1/24 broadcast 192.168.12.255 flags 0

wm1:
flags=0x8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST>
mtu 9000
  capabilities=0x7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx>
  capabilities=0x7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx>
        capabilities=0x7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
        enabled=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
        enabled=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx>
        enabled=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx>
        ec_capabilities=0x17<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,EEE>
        ec_enabled=0x3<VLAN_MTU,VLAN_HWTAGGING>
        address: b4:96:91:92:77:6f
        media: Ethernet autoselect (1000baseT full-duplex)
        status: active
        inet6 fe80::b696:91ff:fe92:776f%wm1/64 flags 0 scopeid 0x2

	Of course, both QNAP are configured with the same MTU.

	That's beeing said, kernel seems to panic at line 1823 of iscsi_ioctl.c
that calls ccb_timeout(ccb_t *ccb).

	Kernel prints :
S-1C-1: ccb_timeout: num=1 total=1 disp=0

	Thus internal variables are :
- ccb->ccb_num_timeouts = 0
- ccb->ccb_total_tries = 1
- ccb->ccb_disp = 0
- conn = ccb->ccb_connection = S-1C-1

ccb->ccb_disp <= CCBDISP_FREE is true as CCBDISP_FREE = 3
(iscsi_globals.h). handle_connection_error() is not reached, thus faulty
function seems to be wake_ccb() or handle_connection_error().

	Without crash dump, I only can debug with panic message.
Address reached in ccb_timeout() is :

legendre# addr2line -e /netbsd.gdb FFFFFFFF80E18CFE
/usr/src/netbsd-10/src/sys/dev/iscsi/iscsi_send.c:1708
(handle_connection_error(conn, ISCSI_STATUS_TIMEOUT, RECOVER_CONNECTION);)

but I cannot obtain the same information for mutex_enter() (and
(handle_connection_error() seems to be compiled as inline function):

legendre# addr2line -e /netbsd.gdb FFFFFFFF8023FA4C
??:?

	Help will be welcome. Of course, I have tried to have a crash dump but
without success (savecore=yes in my rc.conf file). dumps Should be
written on raid0b:

[     5,009093] boot device: raid0
[     5,009093] root on raid0a dumps on raid0b

	If I remember, in the past, crash dumps were written on raid0b without
any trouble.

	Best regards,

	JB

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- [iscsi] Panic in ccb_timeout
  - From: BERTRAND Joël

Prev by Date: Re: NetBSD as an NTP stratum 1 server
Next by Date: Re: NetBSD as an NTP stratum 1 server
Previous by Thread: [iscsi] Panic in ccb_timeout
Next by Thread: "Loss of acceleration on rotation" PR in pre-release tasks, any possibility of making it to 10?
Indexes:

Home | Main Index | Thread Index | Old Index