NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?



On Wed, 14 Jul 2021, Greg Troxel wrote:

What I do is for each of my (physical) disks, spinning and ssd, is (x86
centric; c for others), once every few months

 dd if=/dev/rwd0d of=/dev/null bs=1m

and see if that throws any errors.  If there is one, I try to read that
block a few times, and generally then will 1) take that as a sign to
replace the disk (or move it to an nth external backup) and 2) write
that sector, so that it gets reallocated.  If the disk is part of raid1,


You can make the drive itself do that whole disk scan and collect
the `offline' statistics while it is doing so. This is using the
smartmontools package:

root# smartctl -t long /dev/XXX

The command will show how long it'll take for that test to complete
(a few hours for TB-capacity drives). After the command completes
(or to check on test progress) run:

root# smartctl --all /dev/XXX > /tmp/XXX.smart-log.txt

saturn$ doas atactl wd2 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description                 raw
   1 197   51     yes online  positive    Raw read error rate         38669
   3 176   21     yes online  positive    Spin-up time                6158
   4 100    0     no  online  positive    Start/stop count            510
   5 200  140     yes online  positive    Reallocated sector count    0
   7 200    0     no  online  positive    Seek error rate             0
   9  64    0     no  online  positive    Power-on hours count        26740
  10 100    0     no  online  positive    Spin retry count            0
  11 100    0     no  online  positive    Calibration retry count     0
  12 100    0     no  online  positive    Device power cycle count    506
 192 200    0     no  online  positive    Power-off retract count     99
 193 200    0     no  online  positive    Load cycle count            2672
 194 117    0     no  online  positive    Temperature                 33

 196 200    0     no  online  positive    Reallocated event count     0
 197 200    0     no  online  positive    Current pending sector      18

This is the big deal.  The drive has decided that 18 sectors are not
ok.  It will reallocate them when written, but it is returned
uncorrectable to avoid making that silent data loss for the OS.

 198 100    0     no  offline positive    Offline uncorrectable       0

 199 200    0     no  online  positive    Ultra DMA CRC error count   0
 200 100    0     no  offline positive    Write error rate            0


mp@: What's surprising is, apart from that `Current pending sector'
count--which hasn't dropped below the threshold (none of the current
values have), how pristine the drive looks. Is it a new drive? If
it is, then sector reallocation happening on it is a worry. Are
the cables also OK? Check them, too.

As a comparison, here's what my 15 year old drive looks like:

$ sudo atactl wd0 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description                 raw
  1 119    6     yes online  positive    Raw read error rate         227910048
  3  99    0     yes online  positive    Spin-up time                0
  4  93   20     no  online  positive    Start/stop count            7741
  5 100   36     yes online  positive    Reallocated sector count    0
  7  82   30     yes online  positive    Seek error rate             4464083330
  9  74    0     no  online  positive    Power-on hours count        83567178701327
 10 100   97     yes online  positive    Spin retry count            0
 12  93   20     no  online  positive    Device power cycle count    7724
184 100   99     no  online  positive    End-to-end error            0
187 100    0     no  online  positive    Reported Uncorrectable Errors 0
188 100    0     no  online  positive    Command Timeout             0
189 100    0     no  online  positive    High Fly Writes             0
190  67   45     no  online  positive    Airflow Temperature         33 Lifetime min/max 23/0
191 100    0     no  online  positive    G-sense error rate          179
192 100    0     no  online  positive    Power-off retract count     730
193   1    0     no  online  positive    Load cycle count            1041873
194  33    0     no  online  positive    Temperature                 33 Lifetime min/max 0/19
196  77   30     yes online  positive    Reallocated event count     172189533884560
197 100    0     no  online  positive    Current pending sector      0
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
240  77    0     no  offline positive    Head flying hours           172189533884560
241 100    0     no  offline positive    Total LBAs Written          1786969693
242 100    0     no  offline positive    Total LBAs Read             3326934803
254 100    0     no  online  positive    Free Fall Sensor            0

None of the current value fields have dropped below their thresholds.

FYI mp@: https://www.smartmontools.org/wiki/FAQ

-RVP


Home | Main Index | Thread Index | Old Index