Hello, I'm using NetBSD 10.0 (stable) as rolling release on my main server. This server runs a customized kernel (amd64). Here is the diff between GENERIC and my kernel: legendre# diff -u GENERIC CUSTOM ... powernow0 at cpu0 # AMD PowerNow! and Cool'n'Quiet (non-ACPI) -viac7temp* at cpu? # VIA C7, Nano and Zhaoxin temperature sensor vmt0 at cpu0 # VMware Tools ... -#options ALTQ # Manipulate network interfaces' output queues -#options ALTQ_BLUE # Stochastic Fair Blue -#options ALTQ_CBQ # Class-Based Queueing -#options ALTQ_CDNR # Diffserv Traffic Conditioner -#options ALTQ_FIFOQ # First-In First-Out Queue -#options ALTQ_FLOWVALVE # RED/flow-valve (red-penalty-box) -#options ALTQ_HFSC # Hierarchical Fair Service Curve -#options ALTQ_LOCALQ # Local queueing discipline -#options ALTQ_PRIQ # Priority Queueing -#options ALTQ_RED # Random Early Detection -#options ALTQ_RIO # RED with IN/OUT -#options ALTQ_WFQ # Weighted Fair Queueing +options ALTQ # Manipulate network interfaces' output queues +options ALTQ_BLUE # Stochastic Fair Blue +options ALTQ_CBQ # Class-Based Queueing +options ALTQ_CDNR # Diffserv Traffic Conditioner +options ALTQ_FIFOQ # First-In First-Out Queue +options ALTQ_FLOWVALVE # RED/flow-valve (red-penalty-box) +options ALTQ_HFSC # Hierarchical Fair Service Curve +options ALTQ_LOCALQ # Local queueing discipline +options ALTQ_PRIQ # Priority Queueing +options ALTQ_RED # Random Early Detection +options ALTQ_RIO # RED with IN/OUT +options ALTQ_WFQ # Weighted Fair Queueing -tco* at tcoichbus? # TCO watch dog timer +tco* at ichlpcib? # TCO watch dog timer +pseudo-device iscsi I have upgraded my tree maybe 10 days ago. Before this upgrade system was stable (uptime greater than 120 days). Since last upgrade, kernel has crashed twice. No panic, no output on serial line, no information in log files or screen, physical keyboard doesn't send characters (and of course ctrl+esc doesn't open internal debugger). Kernel is only hard locked. DIAGNOSTIC doesn't return usable information. And DEBUG and LOCKDEBUG cannot be used to debug as these options add a lot of expensive tests (and server is unusable). I have tried to bissect without success. Network configuration : - bridge0 (wm0 + wm1) => connection to two NAS (iscsi kernel initiator) - wm2 => WAN - lagg0 (vm3 + wm4) => LAN - re0 => DMZ altqd runs on this server (I cannot stop altqd with /etc/rc.d/altdq stop, I have to do a kill -9, I have filled a PR a long time ago). Discs : - wd0 + wd1 => ccd0 that contains swap devices for clients (wedges are exported through istgt) - wd2 + wd3 => raidframe (raid 1) for system and local swap - wd4 + wd5 + wd5 => raidframe (raid 5) for home - sd0 => iscsi target (NAS 1) - sd1 => iscsi target (NAS 2) All partitions are formated with FFSv2 with extended attributes and log option. When system enters in lock state, I have to reboot in single user mode to manualy run fsck -fpP. Without this command, when system tried to mount disc device, log are replayed but partitions remain in dirty state (and kernel will panic sooner or later). I haven't written particular configuration in /etc. Only nfsd run with 512 connections (/usr/sbin/nfsd -n 512) but system hangs in the middle of the night when clients and NAS are idle. I've just rebuild a new kernel. I don't know if someone use a system with a similar configuration (I suspect a bad interaction between ccd and iscsi). But how can I found more information to debug ? Best regards, JB
Attachment:
signature.asc
Description: OpenPGP digital signature