NetBSD-Bugs archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

kern/56329: nvme(4) takes long time to umount



>Number:         56329
>Category:       kern
>Synopsis:       nvme(4) takes long time to umount
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sun Jul 25 04:00:00 +0000 2021
>Originator:     Paul Goyette
>Release:        NetBSD 9.99.87
>Organization:
+--------------------+--------------------------+----------------------+
| Paul Goyette       | PGP Key fingerprint:     | E-mail addresses:    |
| (Retired)          | FA29 0E3B 35AF E8AE 6651 | paul%whooppee.com@localhost    |
| Software Developer | 0786 F758 55DE 53BA 7731 | pgoyette%netbsd.org@localhost  |
|                    |                          | pgoyette99%gmail.com@localhost |
+--------------------+--------------------------+----------------------+
>Environment:
	
	
System: NetBSD speedy.whooppee.com 9.99.87 NetBSD 9.99.87 (SPEEDY 2021-07-23 13:58:03 UTC) #0: Sat Jul 24 00:01:27 UTC 2021 paul%speedy.whooppee.com@localhost:/build/netbsd-local/obj/amd64/sys/arch/amd64/compile/SPEEDY amd64
Architecture: x86_64
Machine: amd64
>Description:
	Using nvme(4) device as input AND output device for running
	standard system builds, so there is lots of file creation and
	re-creation and deletion.  Device details are

	<lspci>
	03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961

	<dmesg>
	[     8.118210] nvme0 at pci3 dev 0 function 0: Samsung Electronics (3rd vendor ID) product a804 (rev. 0x00)

	Depending on just how much activity has occurred, it can take
	up to 15 seconds (or even longer) for the device/file-system
	to umount.  During this time, neither xosview nor systat seem
	to observe any disk activity (presumably, no commands are being
	issued), yet the device does not respond.  It "feels like" the
	device is dequeuing a large number of deferred operations which
	are all being processed by the nvme controller without any host
	intervention.

	Here is a time-stamped console log of a system shutdown showing
	the lengthy delay for umount (in this case, ~15 seconds):

	...
	[ 900738.108348] unmounting 0xffff9afce05b4000 /kern (kernfs)...
	[ 900738.108348] unmounted kernfs on /kern type kernfs
	[ 900738.108348] unmounting 0xffff9afce0d30000 /build (/dev/ld0e)...
	[ 900753.244179] unmounted /dev/ld0e on /build type ffs
	[ 900753.244179] unmounting 0xffff9afce052f000 /home (/dev/wd0f)... 
	[ 900753.624326] unmounted /dev/wd0f on /home type ffs
	...

	Given that there seems to be some queue of deferred operations,
	it is not surprising that, after a system crash, a forced
	``fsck -fp'' identifies [tens of] thousands of unreferenced
	files.  So far, fsck has been successful at restoring the file
	system without any data loss.

	This misbehavior occurs whether or not the file system used
	``-o log'' (wapbl).  And it occurs without using ``-o discard''.

	Not sure how to debug this further.  However, dh@ suggests
	that this behavior is definitely suboptimal, and that it
	should be a blocker for the NetBSD-10 release.  (At least, we
	ought to know enough to describe the actual bug, and maybe
	provide a work-around.
	
>How-To-Repeat:
	See above
	
>Fix:
	Yes, please
	

>Unformatted:
 	
 	


Home | Main Index | Thread Index | Old Index