kern/55189: NVMe SSD reports unsafe shutdowns

To: kern-bug-people%netbsd.org@localhost,gnats-admin%netbsd.org@localhost,netbsd-bugs%netbsd.org@localhost
Subject: kern/55189: NVMe SSD reports unsafe shutdowns
From: gson%gson.org@localhost (Andreas Gustafsson)
Date: Mon, 20 Apr 2020 15:25:01 +0000 (UTC)

>Number:         55189
>Category:       kern
>Synopsis:       NVMe SSD reports unsafe shutdowns
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Mon Apr 20 15:25:01 +0000 2020
>Originator:     Andreas Gustafsson
>Release:        NetBSD 9.0
>Organization:

>Environment:
System: NetBSD
Architecture: x86_64
Machine: amd64
>Description:

I'm running NetBSD 9.0/amd64 on a server with an NVMe SSD as a
non-root disk, holding an FFS that is mounted from fstab and actively
used.  The dmesg identifies it as:

   [     1.114589] nvme0 at pci9 dev 0 function 0: vendor 144d product a804 (rev. 0x00)
   [     1.114589] nvme0: NVMe 1.2
   [     1.114589] nvme0: for admin queue interrupting at msix0 vec 0
   [     1.114589] nvme0: Samsung SSD 960 EVO 250GB, firmware 3B7QCXE7, serial S3ESNX1K203084E
   [     1.114589] nvme0: for io queue 1 interrupting at msix0 vec 1 affinity to cpu0
   [     1.114589] nvme0: for io queue 2 interrupting at msix0 vec 2 affinity to cpu1
   [     1.114589] nvme0: for io queue 3 interrupting at msix0 vec 3 affinity to cpu2
   [     1.114589] nvme0: for io queue 4 interrupting at msix0 vec 4 affinity to cpu3
   [     1.114589] nvme0: for io queue 5 interrupting at msix0 vec 5 affinity to cpu4
   [     1.114589] nvme0: for io queue 6 interrupting at msix0 vec 6 affinity to cpu5
   [     1.114589] nvme0: for io queue 7 interrupting at msix0 vec 7 affinity to cpu6
   [     1.114589] ld0 at nvme0 nsid 1
   [     1.114589] ld0: 232 GB, 30401 cyl, 255 head, 63 sec, 512 bytes/sect x 488397168 sectors

I recently figured out the command line for gettings SMART information
from the drive, namely

  sudo smartctl -d nvme,0x1 -a /dev/nvme0

The output was mostly as expected, but one item stood out:

  Unsafe Shutdowns:                   68

This is only two less than the reported number of power cycles,
and appears to increase by one for each power cycle, even after
applying the patch from PR 54969.  The machine is typically 
powered off by running "halt -p".

There are also some items in the SMART error log, but I believe those
are harmless and resulted from me running smartctl with the wrong
options while trying to figure out the right ones.

I see no detach messages for ld0 nor nvme0 on the console when halting
the machine, even though such messages are printed for other disks
(including sd0 since applying the patch from PR 54969).  Are such
messages expected?

So far, I have not experienced any data loss nor noticed any
unexpected fsck activity on boot.

The full smartctl output follows.

smartctl 7.0 2018-12-30 r4883 [NetBSD 9.0 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 250GB
Serial Number:                      S3ESNX1K203084E
Firmware Version:                   3B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 250,059,350,016 [250 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          250,059,350,016 [250 GB]
Namespace 1 Utilization:            223,633,428,480 [223 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5281b2e8cb
Local Time is:                      Mon Apr 20 17:02:41 2020 EEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        41 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    5%
Data Units Read:                    14,061,093 [7.19 TB]
Data Units Written:                 57,169,239 [29.2 TB]
Host Read Commands:                 443,435,057
Host Write Commands:                1,627,039,827
Controller Busy Time:               1,256
Power Cycles:                       70
Power On Hours:                     11,767
Unsafe Shutdowns:                   68
Media and Data Integrity Errors:    0
Error Information Log Entries:      12
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               41 Celsius
Temperature Sensor 2:               54 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         12     0  0x0000  0x4016      -            0 65535     -
  1         11     0  0x0000  0x4016  0x004            0 65535     -
  2         10     0  0x0000  0x4016      -            0 65535     -
  3          9     0  0x0000  0x4016  0x004            0 65535     -
  4          8     0  0x0000  0x4016  0x004            0 65535     -
  5          7     0  0x0000  0x4016  0x004            0 65535     -
  6          6     0  0x0000  0x4016  0x004            0 65535     -
  7          5     0  0x0000  0x4016  0x004            0 65535     -
  8          4     0  0x0000  0x4016  0x004            0 65535     -
  9          3     0  0x0000  0x4016  0x004            0 65535     -
 10          2     0  0x0000  0x4016  0x004            0 65535     -
 11          1     0  0x0000  0x4212  0x028            0 65535     -

>How-To-Repeat:

>Fix:

Follow-Ups:
- re: kern/55189: NVMe SSD reports unsafe shutdowns
  - From: matthew green
- Re: kern/55189: NVMe SSD reports unsafe shutdowns
  - From: Paul Goyette

Prev by Date: Re: kern/55182 (NPF on NetBSD 9 can lock / panic machine)
Next by Date: Re: bin/54997 (dhcpcd does not set IPv6 default route)
Previous by Thread: Re: kern/55182 (NPF on NetBSD 9 can lock / panic machine)
Next by Thread: Re: kern/55189: NVMe SSD reports unsafe shutdowns
Indexes:

Home | Main Index | Thread Index | Old Index