NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/56978: nvme hangs under very heavy loads
>Number: 56978
>Category: kern
>Synopsis: nvme hangs under very heavy loads
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Aug 24 14:25:00 +0000 2022
>Originator: Paul Goyette
>Release: NetBSD 9.99.99
>Organization:
+--------------------+--------------------------+----------------------+
| Paul Goyette | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | FA29 0E3B 35AF E8AE 6651 | paul%whooppee.com@localhost |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette%netbsd.org@localhost |
| & Network Engineer | | pgoyette99%gmail.com@localhost |
+--------------------+--------------------------+----------------------+
>Environment:
System: NetBSD speedy.whooppee.com 9.99.99 NetBSD 9.99.99 (SPEEDY 2022-08-22 19:31:52 UTC) #0: Tue Aug 23 07:05:11 UTC 2022 paul%speedy.whooppee.com@localhost:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/SPEEDY amd64
Architecture: x86_64
Machine: amd64
>Description:
Under very high loads, the nvme driver seems to hang waiting
for an i/o completion that never happens (or is somehow not
seen). Symptoms are zero or one process waiting for i/o
completion (wchan = biolock), several more processes waiting
on wchan = biowait, and some generally large number of procs
hanging in tsile.
Debugging has shown that some number of nvme queues exhibit
large gaps between the queue-head and queue-tail indexes.
For me, this is easily reproducible by running three copies
of ``build.sh release'', each in its own tree, and all output
files (in obj, destdir, tools, and release) are directed to
the same nvme. All source directories are on the same nvme
and are null-mounted read-only on top of the read-write
directories.
Once the hang occurs, the system is still useable, so long
as you don't touch the busy nvme device. I've been able to
reproduce this on both GENERIC and custom kernel configs,
and have successfully been able to get crash dumps and/or
to run gdb(1) on the running kernel.
Here's the portions of dmesg related to the nvmes (manual
line-breaks inserted for readability). The troublesome nvme
.is nvme1
...
[ 1.020867] nvme0 at pci2 dev 0 function 0:
vendor 144d product a804 (rev. 0x00)
[ 1.020867] nvme0: NVMe 1.2
...
[ 1.020867] ld0 at nvme0 nsid 1
[ 1.020867] ld0: 476 GB, 62260 cyl, 255 head,
63 sec, 512 bytes/sect x 1000215216 sectors
...
[ 1.020867] nvme1 at pci5 dev 0 function 0:
vendor 144d product a808 (rev. 0x00)
[ 1.020867] nvme1: NVMe 1.3
...
[ 1.020867] ld1 at nvme1 nsid 1
[ 1.020867] ld1: 1863 GB, 243201 cyl, 255 head,
63 sec, 512 bytes/sect x 3907029168 sectors
...
[ 1.019791] nvme2 at pci6 dev 0 function 0:
vendor 144d product a80a (rev. 0x00)
[ 1.019791] nvme2: NVMe 1.3
...
[ 1.019791] ld2 at nvme2 nsid 1
[ 1.019791] ld2: 1863 GB, 243201 cyl, 255 head,
63 sec, 512 bytes/sect x 3907029168 sectors
...
nvme0 is 512GB Samsung 960 PRO
nvme1 is 2TB Samsung 970 EVO
nvme2 is 2TB Samsung 980 PRO
In order to eliminate possible hardware problems, I moved
everything from nvme1 (970 EVO) to nvme2 (980 PRO). The
problem still occurs, with the same symptoms as above.
>How-To-Repeat:
See above
>Fix:
Don't know, but maybe should be a blocker for -10 release?
Home |
Main Index |
Thread Index |
Old Index