NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/55576: mpt clean volume degrades without clear reason, then panics
>Number: 55576
>Category: kern
>Synopsis: mpt volume degradation and panic
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Aug 14 18:55:00 +0000 2020
>Originator: S.P.Zeidler <spz%NetBSD.org@localhost>
>Release: NetBSD 9.0_STABLE 20200807
>Organization:
The NetBSD Foundation
>Environment:
System: NetBSD babylon5.netbsd.org 9.0_STABLE NetBSD 9.0_STABLE (BABYLON5) #1: Fri Aug 7 10:27:09 UTC 2020 spz%franklin.NetBSD.org@localhost:/home/netbsd/9/amd64/obj/sys/arch/amd64/compile/BABYLON5 amd64
Architecture: x86_64
Machine: amd64
>Description:
the panic: nothing new, happens when there is a raid to mend:
commands time out and the command pool dries up. 35071 is probably related,
but 5 major versions back.
[ 271468.2819715] mpt0: Phy 0: Link Rate 3.0 Gbps
[ 271468.2919772] mpt0: Phy 1: Link Rate 3.0 Gbps
[ 271468.5121031] mpt0: Unknown async event: 0xb
[ 271468.5421203] mpt0: Phy 2: Link Rate 3.0 Gbps
[ 271468.5521261] mpt0: Unknown async event: 0x13
[ 271468.7722520] mpt0: Unknown async event: 0xb
[ 271468.7722520] mpt0: Unknown async event: 0x15
[ 271468.7722520] mpt0: Unknown async event: 0x21
[ 271469.0023837] mpt0: Unknown async event: 0x15
[ 271469.0123894] mpt0: Unknown async event: 0x21
[ 271472.2742556] mpt0: Unknown async event: 0x21
[ 271472.2742556] mpt0: Unknown async event: 0x21
[ 271478.6679137] mpt0: restart succeeded
[ 271486.8225792] mpt0: read_cfg_header timed out
[ 271486.8325852] uvm_fault(0xffff888f9d870a38, 0x0, 1) -> e
[ 271486.8325852] fatal page fault in supervisor mode
[ 271486.8325852] trap type 6 code 0 rip 0xffffffff803bc2c7 cs 0x8 rflags 0x1024
6 cr2 0x10 ilevel 0x6 rsp 0xffffc186d3932bb0
[ 271486.8325852] curlwp 0xffff888f54660720 pid 21093.1 lowest kstack 0xffffc186
d39302c0
kernel: page fault trap, code=0
Stopped in pid 21093.1 (bioctl) at netbsd:mpt_read_cfg_header+0x27:
movq 10(%rax),%rax
db{2}> bt
mpt_read_cfg_header() at netbsd:mpt_read_cfg_header+0x27
mpt_get_cfg_page_ioc2() at netbsd:mpt_get_cfg_page_ioc2+0x23
mpt_bio_ioctl() at netbsd:mpt_bio_ioctl+0x17d
bioioctl() at netbsd:bioioctl+0x289
VOP_IOCTL() at netbsd:VOP_IOCTL+0x54
vn_ioctl() at netbsd:vn_ioctl+0xa5
sys_ioctl() at netbsd:sys_ioctl+0x5ab
syscall() at netbsd:syscall+0x157
--- syscall (number 54) ---
7af49936824a:
what's new:
the volume goes to degraded without a clear reason (like a bad disk).
Log:
# bioctl mpt0 show
Volume Status Size Device/Label Level Stripe
=============================================================
0 Online 1.8T sd0 LSILOGIC Logical Volume 3000 RAID 1 N/A
0:0 Online 1.8T 0:2.0 noencl <ATA ST2000NM0011 SN02>
0:1 Online 1.8T 0:1.0 noencl <ATA ST2000NM0011 SN02>
Machine is up by: Wed Aug 12 20:31:14 UTC 2020
Starting after Fri Aug 14 16:00:00 2020 console log says:
[ 213428.7823240] mpt0: mpt_done: IOC SCSI task terminated!
[ 213428.7823240] mpt0: mpt_done: IOC fatal error: restarting...
[ 213459.0088964] mpt0: soft reset failed: ack timeout
[ 213459.0088964] mpt0: soft reset failed
[repeat the messages in order >200 times until]
[ 216179.3003868] mpt0: mpt_done: IOC SCSI task terminated!
[ 216179.3003868] mpt0: mpt_done: IOC fatal error: restarting...
[ 216179.4404636] mpt0: re-queued 254 requests
[ 216180.9312810] mpt0: Phy 0: Link Rate 3.0 Gbps
[ 216180.9312810] mpt0: Phy 1: Link Rate 3.0 Gbps
[ 216181.1413962] mpt0: Unknown async event: 0xb
[ 216181.1413962] mpt0: Unknown async event: 0x15
[ 216181.1413962] mpt0: Unknown async event: 0x21
[ 216181.1614072] mpt0: Unknown async event: 0xb
[ 216181.1814182] mpt0: Unknown async event: 0xb
[ 216181.1814182] mpt0: Phy 2: Link Rate 3.0 Gbps
[ 216181.2014292] mpt0: Unknown async event: 0x13
[ 216181.4415608] mpt0: Unknown async event: 0xb
[ 216181.4415608] mpt0: Unknown async event: 0x15
[ 216181.4415608] mpt0: Unknown async event: 0x21
[ 216181.6616815] mpt0: Unknown async event: 0x15
[ 216181.6616815] mpt0: Unknown async event: 0x21
[ 216187.2047206] mpt0: Unknown async event: 0x21
[ 216187.2047206] mpt0: Unknown async event: 0x21
[ 216187.2047206] mpt0: Unknown async event: 0xb
[ 216191.3069698] mpt0: restart succeeded
[ 216202.1829332] mpt0: Phy 2: Link Status Unknown
[ 216205.1845792] mpt0: Unknown async event: 0xb
[ 216205.1845792] mpt0: Unknown async event: 0x15
[ 216205.1845792] mpt0: Unknown async event: 0x21
[ 216205.1945845] mpt0: Unknown async event: 0xb
[ 216205.2145956] mpt0: Unknown async event: 0x21
[ 216205.2246010] mpt0: Unknown async event: 0x21
[ 216205.2246010] mpt0: Unknown async event: 0xb
[ 216205.2246010] mpt0: Unknown async event: 0x15
[ 216205.2246010] mpt0: Unknown async event: 0x21
[ 216205.2346065] mpt0: Unknown async event: 0x15
[ 216205.2346065] mpt0: Unknown async event: 0x21
[ 216238.4328103] mpt0: Phy 2: Link Rate 3.0 Gbps
[ 216238.4928423] mpt0: Unknown async event: 0xb
[ 216238.4928423] mpt0: Unknown async event: 0x15
[ 216238.5028481] mpt0: Unknown async event: 0x21
[ 216238.8830568] mpt0: Unknown async event: 0x15
[ 216238.8830568] mpt0: Unknown async event: 0x21
[ 216238.9731065] mpt0: Unknown async event: 0x21
[ 216238.9731065] mpt0: Unknown async event: 0x21
[ 216239.0031229] mpt0: Unknown async event: 0xb
[ 216502.1974219] mpt0: Unknown async event: 0xb
[ 216502.1974219] mpt0: Unknown async event: 0x21
[ 217343.5687086] mpt0: Unknown async event: 0x14
[ 218445.7129594] mpt0: Unknown async event: 0x14
[ 219507.2649365] mpt0: Unknown async event: 0x14
[ 220563.8741977] mpt0: Unknown async event: 0x14
# bioctl mpt0 show
Volume Status Size Device/Label Level Stripe
=============================================================
0 Degraded 1.8T sd0 LSILOGIC Logical Volume 3000 RAID 1 N/A
0:0 Online 1.8T 0:2.0 noencl <ATA ST2000NM0011 SN02>
0:1 Online 1.8T 0:1.0 noencl <ATA ST2000NM0011 SN02>
until Aug 4th the system ran an 8_STABLE kernel and had occasionally
3 or 4 mpt_done timeouts. With the netbsd-9 20200802 and 20200807
kernels, the situation has become significantly less stable.
>How-To-Repeat:
>Fix:
Home |
Main Index |
Thread Index |
Old Index