tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: NetBSD 9.2 (STABLE) kernel hangs without panic or ddb



On 2021/12/07 7:41, BERTRAND Joël wrote:
Hello,

My main server runs NetBSD 9.2 (STABLE). It has 16 GB of RAM, an i7 CPU (i7-4770), a lot of disks (wd0/wd1 configured as ccd [swap for diskless workstation], wd2/wd3 as raid level 1 [system], wd4/wd5/wd6 as raid level 5 [home] and an external NAS [iscsi target, bacula archives]). It runs a customized kernel as I have configured ALTQ. I have noticed that I cannot stop altqd or reload configuration without trouble : when altqd stops or restarts, it always takes 100% of a CPU...

Ethernet link used to access to NAS is configured on both sides with MTU=9000 (wm0) :

wm0: flags=0x8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 9000
         capabilities=7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx>
capabilities=7ff80<TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx>
         capabilities=7ff80<TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
         enabled=0
         ec_capabilities=17<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,EEE>
         ec_enabled=2<VLAN_HWTAGGING>
         address: b4:96:91:92:77:6e
         media: Ethernet autoselect (1000baseT full-duplex)
         status: active
         inet 192.168.12.1/24 broadcast 192.168.12.255 flags 0x0
         inet6 fe80::b696:91ff:fe92:776e%wm0/64 flags 0x0 scopeid 0x1

I don't know when I have seen this issue the first time, but if I remember, 9.0 ran fine. Maybe 9.1 also. With 9.2, when bacula starts its monthly archive (around 50 files, 50 GB each on a ffs2 filesystem), kernel can crash, randomly. Since yesterday, I have seen two crashes.

System doesn't respond anymore, kernel doesn't enter in ddb even if ddb is set in sysctl, magic request doesn't do anything. Kernel doesn't panic. I have tried to access to serial console, no answer. No dump file. System only stops, maybe on a mutex or a spinlock.

If I umount iscsi target, system seems to be stable. I don't know it this issue is related to iscsi initiator or related to an interaction between iscsi and, maybe, ccd driver... And I cannot remove functions from this server to test.

I have to fix this issue, but why ? I don't have any usable information...

I think it would worth to try LOCKDEBUG.

Best regards,

JB


--
-----------------------------------------------
                SAITOH Masanobu (msaitoh%execsw.org@localhost
                                 msaitoh%netbsd.org@localhost)


Home | Main Index | Thread Index | Old Index