NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?



On Wed, 14 Jul 2021 at 12:07, Matthias Petermann <mp%petermann-it.de@localhost> wrote:
>
> Hello all,
>
>
> ```
> [ 87240.313853] wd2: (uncorrectable data error)
> [ 87240.313853] wd2d: error reading fsbn 5707914328 of
> 5707914328-5707914455 (wd2 bn 5707914328; cn 5662613 tn 6 sn 46)
> [ 87465.637977] wd2d: error reading fsbn 5710464152 of
> 5710464152-5710464215 (wd2 bn 5710464152; cn 5665143 tn 0 sn 8), xfer
> 338, retry 0
> [ 87465.637977] wd2: (uncorrectable data error)
> [ 87475.561683] wd2: soft error (corrected) xfer 338
> [ 87506.393194] wd2d: error reading fsbn 5710555128 of
> 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer
> 40, retry 0
> [ 87506.393194] wd2: (uncorrectable data error)
> [ 87515.156465] wd2d: error reading fsbn 5710555128 of
> 5710555128-5710555255 (wd2 bn 5710555128; cn 5665233 tn 4 sn 12), xfer
> 40, retry 1
> ```
>
> The whole syslog is full of these messages. What surprises me is that
> there are "uncorrectable" data errors in the syslog. Nevertheless, the
> data can still be read - albeit very slowly. My assumption was that the
> redundancies of RAID2 are being used to compensate for the defects. To
> my surprise, ZFS does not seem to have noticed any of these defects:
>
The wd driver is retrying, (IIRC it retries 3 times) and suceeding on
the second or 3rd attempt. (See xfer 338, retry 0, followed by a 'soft
error corrected' with the same xfer number 10 seconds later. This is
the retry suceeding).

This sits below ZFS and therefore ZFS never sees the error. If the
read failed 3 times you'd probably get a data error in ZFS.

>
> For the sake of completeness, here is the issue of S.M.A.R.T. - even if
> I find it difficult to interpret:
>
> ```
> saturn$ doas atactl wd2 smart status
> SMART supported, SMART enabled
> id value thresh crit collect reliability description                 raw
>    1 197   51     yes online  positive    Raw read error rate         38669
>    3 176   21     yes online  positive    Spin-up time                6158
>    4 100    0     no  online  positive    Start/stop count            510
>    5 200  140     yes online  positive    Reallocated sector count    0

I was expecting to see this value greater than 0 if the drive was
failing, is the drive bad or the cabling?

Cheers,

Ian


Home | Main Index | Thread Index | Old Index