NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: diagnosis for disk drive errors (zfs on cgd on sata disk)



Duplicate, please ignore.  Apologies for the noise.

On Fri, 20 Aug 2021 at 06:34 +0200, Pouya Tafti wrote:
> After a recent drive failure in my primary zfs pool, I set
> up a secondary pool on a cgd(4) device on a single new sata
> hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf
> hdd) to back up the primary.
> 
> I initialy scrubbed the entire disk without apparent
> incident using a temporary cryptographic device and dd(1)
> as in the cgdconfig(8) man page.
> 
> Since then, twice already, in the past two days, the drive
> has failed in the same way and been detached, once on the
> very first zfs(8) create operation, and the second time
> (after a reboot) after having written hundreds of GiBs to
> it with a zfs(8) send/receive pipe.  Here are the relevant
> system messages:
> 
> # dmesg
> ...
> [ 57131.573806] mpii0: physical device removed from slot 7
> [ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 (sd7 bn 1816866262; cn 894127 tn 1 sn 71)
> [ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 (cgd0 bn 1816604078; cn 887013 tn 0 sn 1454)
> [ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn 270904; cn 133 tn 5 sn 13)
> [ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 (sd7 bn 7814028344; cn 3845486 tn 6 sn 30)
> [ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 (sd7 bn 7814028856; cn 3845486 tn 10 sn 34)
> [ 57131.573806] sd7: autoconfiguration error: cache synchronization failed
> [ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 (cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552)
> [ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 (cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040)
> [ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn 4 tn 0 sn 528)
> [ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 (sd7 bn 1816866646; cn 894127 tn 4 sn 74)
> [ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 (cgd0 bn 1816604462; cn 887013 tn 0 sn 1838)
> [ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 (sd7 bn 1816866518; cn 894127 tn 3 sn 73)
> [ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 (cgd0 bn 1816604334; cn 887013 tn 0 sn 1710)
> [ 57131.593815] sd7: autoconfiguration error: cache synchronization failed
> [ 57131.643840] dk11 at sd7 (backupcgd0) deleted
> [ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted
> [ 57131.643840] sd7: detached
> 
> I don't know how to go about diagnosing the issue and would
> appreciate any suggestions.  In particular, the hdd is new
> and I wonder if I should return it for a replacement.  The
> previous disk in the same bay had also been showing
> read/write errors (the other drive never got detached,
> though).
> 
> Apart from the drive, I have also little faith in the
> backplate, cables, SAS controller (which I reflashed), RAM,
> etc., although here it looks to me like the problem could
> be somewhere between the drive and the controller.
> 
> Many thanks,
> Pouya
> 
> N.B. I'm also a bit confused by how zfs is handling this:
> zpool(8) appears to think the drive is still online, while
> zfs(8) doesn't list any datasets on it:
> 
> # zpool status -v puddle
>   pool: puddle
>  state: ONLINE
> status: One or more devices are faulted in response to IO failures.
> action: Make sure the affected devices are connected, then run 'zpool clear'.
>    see: http://illumos.org/msg/ZFS-8000-HC
>   scan: none requested
> config:
> 
> 	NAME              STATE     READ WRITE CKSUM
> 	puddle            ONLINE       0 3.62K     0
> 	  wedges/backup0  ONLINE       0   213     0
> 
> errors: Permanent errors have been detected in the following files:
> 
>         puddle/backup.pond/backup:<0x0>
>         puddle/backup.pond/backup:<0x10ecc5>
> 
> # zfs list puddle
> cannot open 'puddle': pool I/O is currently suspended
> 


Home | Main Index | Thread Index | Old Index