Re: [10.0_STABLE] Hard lock

To: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Subject: Re: [10.0_STABLE] Hard lock
From: BERTRAND Joël <joel.bertrand%systella.fr@localhost>
Date: Sat, 5 Oct 2024 12:17:47 +0200

Taylor R Campbell a écrit :
>> Date: Fri, 4 Oct 2024 10:37:24 +0200
>> From: BERTRAND Joël <joel.bertrand%systella.fr@localhost>
>>
>> -tco*   at tcoichbus?           # TCO watch dog timer
>> +tco*   at ichlpcib?            # TCO watch dog timer
> 
> This a curious change to make; what prompted it?  Are you using the
> watchdog timer?  I'm slightly surprised this builds at all, and I'm
> not sure it will work.

	I don't remember I have done this configuration... My config fiel was
written a long time ago.

	For your information, server has crashed last night.

>> 	I have upgraded my tree maybe 10 days ago. Before this upgrade system
>> was stable (uptime greater than 120 days).
> 
> When was your tree previously updated?  This might help to narrow down
> which change might have introduced the problem.  (And, if you can
> bisect, that would be even more helpful!)

	Last running kernel has a uptime greater than 100 days. I have rebooted
with a up to date -10.0 kernel. Thus, I think faulty patch was
introduced after may 2024.

>> 	I've just rebuild a new kernel. I don't know if someone use a system
>> with a similar configuration (I suspect a bad interaction between ccd
>> and iscsi). But how can I found more information to debug ?

	I have rebuilt a kernel (same tree) with all diagnostic options. It
panics in iscsi routines when iscsictl tries to connect to first iscsi
volume.

[    74.238270] panic: mutex_vector_enter,517: uninitialized lock
(lock=0xffff938021d86010, from=ffffffff80f71234)
[    74.238270] cpu1: Begin traceback...
[    74.238270] vpanic() at netbsd:vpanic+0x183
[    74.238270] panic() at netbsd:panic+0x3c
[    74.238270] lockdebug_wantlock() at netbsd:lockdebug_wantlock+0x180
[    74.248268] mutex_enter() at netbsd:mutex_enter+0x23f
[    74.248268] send_pdu() at netbsd:send_pdu+0x1b5
[    74.248268] send_logout() at netbsd:send_logout+0x1d4
[    74.248268] kill_connection() at netbsd:kill_connection+0x2fa
[    74.248268] kill_session() at netbsd:kill_session+0x134
[    74.248268] iscsiioctl() at netbsd:iscsiioctl+0x30f
[    74.248268] sys_ioctl() at netbsd:sys_ioctl+0x56d
[    74.248268] syscall() at netbsd:syscall+0x196
[    74.248268] --- syscall (number 54) ---
[    74.248268] netbsd:syscall+0x196:
[    74.248268] cpu1: End traceback...

You can download faulty kernel (with and without debug option) at
ftp://newton.systella.fr. (files NETBSD.;1 and NETBSD.GDB;1).

Please note that this server runs OpenVMS and use binary transfer.

> You could try a current kernel.  If the problem is there in current,
> it may be detected -- and reported in a more obvious way -- by the new
> heartbeat(9) diagnostic where each CPU's progress is periodically
> checked on by some other CPU
	I will try.

	Please note also last I cannot reboot my server with shutdown -r now if
I haven't killed (with kill -9) altqd. For me, it's not a real issue as
this server is two floors below my office, but for some users, if server
was far away...

	Best regartds,

	JB

Attachment: signature.asc
Description: OpenPGP digital signature

Follow-Ups:
- Re: [10.0_STABLE] Hard lock
  - From: Rin Okuyama

References:
- Re: [10.0_STABLE] Hard lock
  - From: Taylor R Campbell

Prev by Date: Re: [netbsd-10] src bus_dma.9 copyedit
Next by Date: Re: [10.0_STABLE] Hard lock
Previous by Thread: Re: [10.0_STABLE] Hard lock
Next by Thread: Re: [10.0_STABLE] Hard lock
Indexes:

Home | Main Index | Thread Index | Old Index