NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?



Hello all,

I run a NetBSD-based NAS at home. It is currently running on NetBSD 9.1. The system is booted from a USB stick on which the root file system is also located. The storage is on 4 x 4 TB magnetic hard disks, configured as ZFS RAIDZ2.

Earlier I noticed that the I/O performance of the system suddenly collapsed drastically. A look at the syslog gives a pretty clear indication of the reason:

```
[ 87240.313853] wd2: (uncorrectable data error)
[ 87240.313853] wd2d: error reading fsbn 5707914328 of 5707914328-5707914455 (wd2 bn 5707914328; cn 5662613 tn 6 sn 46) [ 87465.637977] wd2d: error reading fsbn 5710464152 of 5710464152-5710464215 (wd2 bn 5710464152; cn 5665143 tn 0 sn 8), xfer 338, retry 0
[ 87465.637977] wd2: (uncorrectable data error)
[ 87475.561683] wd2: soft error (corrected) xfer 338
[ 87506.393194] wd2d: error reading fsbn 5710555128 of 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer 40, retry 0
[ 87506.393194] wd2: (uncorrectable data error)
[ 87515.156465] wd2d: error reading fsbn 5710555128 of 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer 40, retry 1
```

The whole syslog is full of these messages. What surprises me is that there are "uncorrectable" data errors in the syslog. Nevertheless, the data can still be read - albeit very slowly. My assumption was that the redundancies of RAID2 are being used to compensate for the defects. To my surprise, ZFS does not seem to have noticed any of these defects:


```
saturn$ doas zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            dk0     ONLINE       0     0     0
            dk1     ONLINE       0     0     0
            dk2     ONLINE       0     0     0
            dk3     ONLINE       0     0     0

errors: No known data errors
```

Another indication that ZFS has not yet noticed the error: with top, there is no significant CPU load during I/O, neither in the user nor the system area. I would have expected this at least in the case when ZFS works with redundancies.

So it looks like the hardware error can still be corrected as far as possible at the level of the device driver, which makes me doubt the truth of the statement "uncorrectable data error".

Does anyone know what would have to happen for ZFS to notice the hardware defect?

Next, I will try to take the wd2 (dk2) component offline.

For the sake of completeness, here is the issue of S.M.A.R.T. - even if I find it difficult to interpret:

```
saturn$ doas atactl wd2 smart status
SMART supported, SMART enabled
id value thresh crit collect reliability description                 raw
  1 197   51     yes online  positive    Raw read error rate         38669
  3 176   21     yes online  positive    Spin-up time                6158
  4 100    0     no  online  positive    Start/stop count            510
  5 200  140     yes online  positive    Reallocated sector count    0
  7 200    0     no  online  positive    Seek error rate             0
  9  64    0     no  online  positive    Power-on hours count        26740
 10 100    0     no  online  positive    Spin retry count            0
 11 100    0     no  online  positive    Calibration retry count     0
 12 100    0     no  online  positive    Device power cycle count    506
192 200    0     no  online  positive    Power-off retract count     99
193 200    0     no  online  positive    Load cycle count            2672
194 117    0     no  online  positive    Temperature                 33
196 200    0     no  online  positive    Reallocated event count     0
197 200    0     no  online  positive    Current pending sector      18
198 100    0     no  offline positive    Offline uncorrectable       0
199 200    0     no  online  positive    Ultra DMA CRC error count   0
200 100    0     no  offline positive    Write error rate            0
```

Kind regards
Matthias

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature



Home | Main Index | Thread Index | Old Index