NetBSD-Bugs archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
kern/55189: NVMe SSD reports unsafe shutdowns
>Number: 55189
>Category: kern
>Synopsis: NVMe SSD reports unsafe shutdowns
>Confidential: no
>Severity: non-critical
>Priority: medium
>Responsible: kern-bug-people
>State: open
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Mon Apr 20 15:25:01 +0000 2020
>Originator: Andreas Gustafsson
>Release: NetBSD 9.0
>Organization:
>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:
I'm running NetBSD 9.0/amd64 on a server with an NVMe SSD as a
non-root disk, holding an FFS that is mounted from fstab and actively
used. The dmesg identifies it as:
[ 1.114589] nvme0 at pci9 dev 0 function 0: vendor 144d product a804 (rev. 0x00)
[ 1.114589] nvme0: NVMe 1.2
[ 1.114589] nvme0: for admin queue interrupting at msix0 vec 0
[ 1.114589] nvme0: Samsung SSD 960 EVO 250GB, firmware 3B7QCXE7, serial S3ESNX1K203084E
[ 1.114589] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity to cpu0
[ 1.114589] nvme0: for io queue 2 interrupting at msix0 vec 2 affinity to cpu1
[ 1.114589] nvme0: for io queue 3 interrupting at msix0 vec 3 affinity to cpu2
[ 1.114589] nvme0: for io queue 4 interrupting at msix0 vec 4 affinity to cpu3
[ 1.114589] nvme0: for io queue 5 interrupting at msix0 vec 5 affinity to cpu4
[ 1.114589] nvme0: for io queue 6 interrupting at msix0 vec 6 affinity to cpu5
[ 1.114589] nvme0: for io queue 7 interrupting at msix0 vec 7 affinity to cpu6
[ 1.114589] ld0 at nvme0 nsid 1
[ 1.114589] ld0: 232 GB, 30401 cyl, 255 head, 63 sec, 512 bytes/sect x 488397168 sectors
I recently figured out the command line for gettings SMART information
from the drive, namely
sudo smartctl -d nvme,0x1 -a /dev/nvme0
The output was mostly as expected, but one item stood out:
Unsafe Shutdowns: 68
This is only two less than the reported number of power cycles,
and appears to increase by one for each power cycle, even after
applying the patch from PR 54969. The machine is typically
powered off by running "halt -p".
There are also some items in the SMART error log, but I believe those
are harmless and resulted from me running smartctl with the wrong
options while trying to figure out the right ones.
I see no detach messages for ld0 nor nvme0 on the console when halting
the machine, even though such messages are printed for other disks
(including sd0 since applying the patch from PR 54969). Are such
messages expected?
So far, I have not experienced any data loss nor noticed any
unexpected fsck activity on boot.
The full smartctl output follows.
smartctl 7.0 2018-12-30 r4883 [NetBSD 9.0 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 960 EVO 250GB
Serial Number: S3ESNX1K203084E
Firmware Version: 3B7QCXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 250,059,350,016 [250 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 250,059,350,016 [250 GB]
Namespace 1 Utilization: 223,633,428,480 [223 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5281b2e8cb
Local Time is: Mon Apr 20 17:02:41 2020 EEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 77 Celsius
Critical Comp. Temp. Threshold: 79 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.04W - - 0 0 0 0 0 0
1 + 5.09W - - 1 1 1 1 0 0
2 + 4.08W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 5%
Data Units Read: 14,061,093 [7.19 TB]
Data Units Written: 57,169,239 [29.2 TB]
Host Read Commands: 443,435,057
Host Write Commands: 1,627,039,827
Controller Busy Time: 1,256
Power Cycles: 70
Power On Hours: 11,767
Unsafe Shutdowns: 68
Media and Data Integrity Errors: 0
Error Information Log Entries: 12
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 41 Celsius
Temperature Sensor 2: 54 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 12 0 0x0000 0x4016 - 0 65535 -
1 11 0 0x0000 0x4016 0x004 0 65535 -
2 10 0 0x0000 0x4016 - 0 65535 -
3 9 0 0x0000 0x4016 0x004 0 65535 -
4 8 0 0x0000 0x4016 0x004 0 65535 -
5 7 0 0x0000 0x4016 0x004 0 65535 -
6 6 0 0x0000 0x4016 0x004 0 65535 -
7 5 0 0x0000 0x4016 0x004 0 65535 -
8 4 0 0x0000 0x4016 0x004 0 65535 -
9 3 0 0x0000 0x4016 0x004 0 65535 -
10 2 0 0x0000 0x4016 0x004 0 65535 -
11 1 0 0x0000 0x4212 0x028 0 65535 -
>How-To-Repeat:
>Fix:
Home |
Main Index |
Thread Index |
Old Index