NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: zfs resilver in(de)finite loop?



Pouya Tafti <pouya+lists.netbsd%nohup.io@localhost> writes:

[snip]

> # zpool replace pond wedges/slot4zfs wedges/slot7zfs
>
> many hours ago.  Since then, as I periodically check
> zpool(8) status it appears that the various counters and
> timers keep starting over, while the error rates keep
> increasing.  Most recently:
>
> # zpool status
>
>   pool: pond
>  state: ONLINE
> status: One or more devices is currently being resilvered.  The pool will
> 	continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
>   scan: resilver in progress since Sat Aug 14 21:02:49 2021
>         118G scanned out of 1.59T at 230M/s, 1h52m to go
>         19.6G resilvered, 7.23% done
> config:
>
> 	NAME                   STATE     READ WRITE CKSUM
> 	pond                   ONLINE       0     0     0
> 	  raidz2-0             ONLINE       0     0     0
> 	    wedges/slot0zfs    ONLINE       0     0     0
> 	    wedges/slot1zfs    ONLINE       0     0     0
> 	    wedges/slot2zfs    ONLINE       0     0     0
> 	    wedges/slot3zfs    ONLINE       0     0     0
> 	    replacing-4        ONLINE       0     0   945
> 	      wedges/slot4zfs  ONLINE     299 5.07K     0  (resilvering)
> 	      wedges/slot7zfs  ONLINE       0     0     0  (resilvering)
> 	    wedges/slot5zfs    ONLINE       0     0     0
>
> errors: No known data errors
>

[snip]

So...  it looks like it may have tried to resilver the failing drive
when you performed the replacement or had started to resilver the
failing drive as you performed the replacement.  In another OS with ZFS
I have seen something like this resilver restarting behavior.  In my
case it ultimately finished I just had to wait a while.

As a general rule, although I will say not a required-hard-rule, it
would be a good idea to take not fully failed ZFS member offline before
doing a replacement.  If the member has failed completely, that is
different, but if there is any chance that it may actually have
function, it is better is offline it first and then replace.

In this case, although I will admit I am not completely sure, I think
you can still offline the failing drive and the resilvering of the
replacement might proceed as you expect.

I think what you may be seeing is that ZFS is trying to rebuild the
failing drive from the rest of the raid members and at the same time
trying to replace it and the churn of doing that may be tripping the
restart or you are seeing threaded output of one resilver and then
another.  I believe that it is permitted to perform more than one at a
time.





-- 
Brad Spencer - brad%anduin.eldar.org@localhost - KC8VKS - http://anduin.eldar.org


Home | Main Index | Thread Index | Old Index