Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?

To: RVP <rvp%SDF.ORG@localhost>
Subject: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
From: Matthias Petermann <mp%petermann-it.de@localhost>
Date: Sat, 17 Jul 2021 08:45:01 +0200

Hello together,

The story is slowly coming to a conclusion and I would like to describemy observations for the sake of completeness.

According to [1], SATA/ATA on NetBSD does not support hot swap.Therefore, I shut down the NAS and swapped the disk in a powerless state.

I installed the device like I got it out of the box, i.e. I did not makeany special preparations.

Because the device did not have any partitions when it was booted, thewedge "dk3" of the last (non-defective) hard disk slipped to the frontand was assigned to "dk2". After creating the partition on wd2, it wasrecognised as "dk3". The result was this:


```
# zpool status
  pool: tank
 state: DEGRADED

status: One or more devices could not be opened. Sufficient replicasexist for

        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            dk0                   ONLINE       0     0     0
            dk1                   ONLINE       0     0     0
            12938104637987333436  OFFLINE      0     0     0  was /dev/dk2
            11417607113939770484  UNAVAIL      0     0     0  was /dev/dk3

errors: No known data errors
```

After another reboot, the order was correct again:

```
saturn$ doas zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An

attempt was made to correct the error. Applications areunaffected.

action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 140K in 0h0m with 0 errors on Sat Jul 17 08:14:34 2021
config:

        NAME                      STATE     READ WRITE CKSUM
        tank                      DEGRADED     0     0     0
          raidz2-0                DEGRADED     0     0     0
            dk0                   ONLINE       0     0     0
            dk1                   ONLINE       0     0     0
            12938104637987333436  OFFLINE      0     0     0  was /dev/dk2
            dk3                   ONLINE       0     0     1

errors: No known data errors
```

However, a "1" appears in the dk2 statistics under CKSUM.

I then initiated the replacement of the ZFS component as follows:

```
saturn$ doas zpool replace tank /dev/dk2
```

With the result:

```
saturn$ doas zpool status
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Jul 17 08:18:56 2021
        16.0G scanned out of 5.69T at 123M/s, 13h24m to go
        3.87G resilvered, 0.27% done
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            dk0                     ONLINE       0     0     0
            dk1                     ONLINE       0     0     0
            replacing-2             OFFLINE      0     0     0

12938104637987333436 OFFLINE 0 0 0 was/dev/dk2/olddk2 ONLINE 0 0 0(resilvering)

            dk3                     ONLINE       0     0     1

errors: No known data errors
```

So things are looking good for the time being - I'll keep an eye onwhether the CKSUM will also be solved in the course of this, or whetheranother "construction site" is waiting for me here ;-)

I still have one small question: when initialising the RAID, I had setit to GPT partitions so that I could use the full storage capacity ofthe disks (instead of the 2 TB limit with disklabel) and also leave somebuffer space free in case a replacement drive has a few sectors lessthan the existing ones. Now it looks as if the dynamic allocation of thewedges at boot time unnecessarily endangers the RAID in the event of adisk change. Therefore the question: is there a better possibilitybesides using the wedges? I remember that I had also tried the variantwith the label NAME=zfs2 when I created it (as it works with newfs, forexample), but it didn't work. Ok - as a workaround I could have preparedthe disk on another system - for the next time I know that now.



Kind regards
Matthias



[1] https://mail-index.netbsd.org/netbsd-users/2011/01/28/msg007735.html

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Michael van Elst

References:
- ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Matthias Petermann
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Greg Troxel
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: RVP
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Greg Troxel
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: RVP
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Matthias Petermann
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Michael van Elst
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Matthias Petermann
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: RVP
- Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
  - From: Matthias Petermann

Prev by Date: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
Next by Date: backspace in wscons console sends ^H to processes
Previous by Thread: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
Next by Thread: Re: ZFS RAIDZ2 and wd uncorrectable data error - why does ZFS not notice the hardware error?
Indexes:

Home | Main Index | Thread Index | Old Index