Nobody ? -10.0 seems to be unusable with iscsi initior (targets are QNAP NAS). System randomly panics, sometimes after few hours, sometimes after few weeks. It's not related to I/O (last panic occurs when both iSCSI targets are idle. QNAP firmware switchs off disks when there is not activity on target). During last panic, dmesg contains : [ 2069698,164027] S3C1: freeing UNUSED pdu [ 2090569,298079] fatal breakpoint trap in supervisor mode [ 2090569,298079] trap type 1 code 0 rip 0xffffffff80235385 cs 0x8 rflags 0x202 cr2 0x7f07e8b036f8 ilevel 0x6 rsp 0xffff9b84372a1b00 [ 2090569,298079] curlwp 0xffff87b59440a080 pid 0.5 lowest kstack 0xffff9b843729d2c0 S3C1: freeing UNUSED pdu [ 2421020,454350] S-1C-1: ccb_timeout: num=1 total=1 disp=0 [ 2421020,454350] uvm_fault(0xffffffff819755e0, 0x0, 2) -> e [ 2421020,454350] fatal page fault in supervisor mode [ 2421020,454350] trap type 6 code 0x2 rip 0xffffffff8023fa4c cs 0x8 rflags 0x10246 cr2 0x10 ilevel 0 rsp 0xffff9b8448074f68 [ 2421020,454350] curlwp 0xffff87b296e8f980 pid 0.325 lowest kstack 0xffff9b84480702c0 [ 2421020,454350] panic: trap [ 2421020,454350] cpu3: Begin traceback... [ 2421020,454350] vpanic() at netbsd:vpanic+0x183 [ 2421020,454350] panic() at netbsd:panic+0x3c [ 2421020,454350] trap() at netbsd:trap+0xbaf [ 2421020,454350] --- trap (number 6) --- [ 2421020,454350] mutex_enter() at netbsd:mutex_enter+0xc [ 2421020,454350] ccb_timeout() at netbsd:ccb_timeout+0xa6 [ 2421020,454350] iscsi_cleanup_thread() at netbsd:iscsi_cleanup_thread+0x333 [ 2421020,454350] cpu3: End traceback... [ 2421020,454350] dumping to dev 18,1 (offset=251919, size=4162814): [ 2421020,454350] dump [ 1.0000000] Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, [ 1,000000] 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, [ 1,000000] 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023 [ 1,000000] The NetBSD Foundation, Inc. All rights reserved. but no dump is written in /var/crash. swap is on a raidframe slice, maybe it's related. Please note that dmesg contains two iscsi errors (S3C1: freeing UNUSED pdu, why ?) and another one (S-1C-1: ccb_timeout). Maybe both errors are related, but I don't know if during last panic, kernel complains about "freeing UNUSED pdu". iSCSI initiator is connected to two targets (ffs V2): [ 85.069032] scsibus0 at iscsi0: 1 target, 16 luns per target [ 85.069032] sd0 at scsibus0 target 0 lun 0: <QNAP, iSCSI Storage, 4.0> disk fixed [ 85.089034] scsibus1 at iscsi0: 1 target, 16 luns per target [ 85.099032] sd0: fabricating a geometry [ 85.099032] sd0: 10980 GB, 11244416 cyl, 64 head, 32 sec, 512 bytes/sect x 23028563968 sectors [ 85.119032] sd0: fabricating a geometry [ 85.149033] sd0: GPT GUID: a5d27c7c-8eda-40e8-a29b-e85a539a5bc7 [ 85.149033] dk6 at sd0: "bacula", 23028563901 blocks at 34, type: ffs [ 85.149033] sd0: async, 8-bit transfers, tagged queueing [ 85.149033] sd1 at scsibus1 target 0 lun 0: <QNAP, iSCSI Storage, 4.0> disk fixed [ 85.169034] sd1: fabricating a geometry [ 85.169034] sd1: 10988 GB, 11251968 cyl, 64 head, 32 sec, 512 bytes/sect x 23044030464 sectors [ 85.199034] sd1: fabricating a geometry [ 85.219032] sd1: GPT GUID: 799b4d25-970c-4a32-a388-a59470280de0 [ 85.229034] dk7 at sd1: "video", 23044030397 blocks at 34, type: ffs [ 85.229034] sd1: async, 8-bit transfers, tagged queueing [ 86.789035] /opt/bacula: replaying log to disk [ 87.269034] /opt/video: replaying log to disk These targets are directly connected to this server on bridge0 ethernet interface: bridge0: flags=0x41<UP,RUNNING> mtu 9000 capabilities=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx> capabilities=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx> capabilities=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx> enabled=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx> enabled=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx> enabled=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx> status: active and bridge0 is built with wm0 and wm1: wm0: flags=0x8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> mtu 9000 capabilities=0x7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx> capabilities=0x7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx> capabilities=0x7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6> enabled=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx> enabled=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx> enabled=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx> ec_capabilities=0x17<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,EEE> ec_enabled=0x3<VLAN_MTU,VLAN_HWTAGGING> address: b4:96:91:92:77:6e media: Ethernet autoselect (1000baseT full-duplex,master) status: active inet6 fe80::b696:91ff:fe92:776e%wm0/64 flags 0 scopeid 0x1 inet 192.168.12.1/24 broadcast 192.168.12.255 flags 0 wm1: flags=0x8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> mtu 9000 capabilities=0x7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx> capabilities=0x7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx> capabilities=0x7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6> enabled=0x3ff00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx> enabled=0x3ff00<UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx> enabled=0x3ff00<UDP6CSUM_Rx,UDP6CSUM_Tx> ec_capabilities=0x17<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,EEE> ec_enabled=0x3<VLAN_MTU,VLAN_HWTAGGING> address: b4:96:91:92:77:6f media: Ethernet autoselect (1000baseT full-duplex) status: active inet6 fe80::b696:91ff:fe92:776f%wm1/64 flags 0 scopeid 0x2 Of course, both QNAP are configured with the same MTU. That's beeing said, kernel seems to panic at line 1823 of iscsi_ioctl.c that calls ccb_timeout(ccb_t *ccb). Kernel prints : S-1C-1: ccb_timeout: num=1 total=1 disp=0 Thus internal variables are : - ccb->ccb_num_timeouts = 0 - ccb->ccb_total_tries = 1 - ccb->ccb_disp = 0 - conn = ccb->ccb_connection = S-1C-1 ccb->ccb_disp <= CCBDISP_FREE is true as CCBDISP_FREE = 3 (iscsi_globals.h). handle_connection_error() is not reached, thus faulty function seems to be wake_ccb() or handle_connection_error(). Without crash dump, I only can debug with panic message. Address reached in ccb_timeout() is : legendre# addr2line -e /netbsd.gdb FFFFFFFF80E18CFE /usr/src/netbsd-10/src/sys/dev/iscsi/iscsi_send.c:1708 (handle_connection_error(conn, ISCSI_STATUS_TIMEOUT, RECOVER_CONNECTION);) but I cannot obtain the same information for mutex_enter() (and (handle_connection_error() seems to be compiled as inline function): legendre# addr2line -e /netbsd.gdb FFFFFFFF8023FA4C ??:? Help will be welcome. Of course, I have tried to have a crash dump but without success (savecore=yes in my rc.conf file). dumps Should be written on raid0b: [ 5,009093] boot device: raid0 [ 5,009093] root on raid0a dumps on raid0b If I remember, in the past, crash dumps were written on raid0b without any trouble. Best regards, JB
Attachment:
signature.asc
Description: OpenPGP digital signature