NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: diagnosis for disk drive errors (zfs on cgd on sata disk)



On Fri, 20 Aug 2021 at 06:13 -0000, Michael van Elst wrote:
> [snip]
> > Yes. It could be the drive itself, but I'd suspect the
> > backplane or cables. The PSU is also a possible candidate.

On Fri, 20 Aug 2021 at 09:31 +0200, Pouya Tafti wrote:
> Thanks.  Retrying the replication in another bay now before
> opening up the box. 

The replication progressed for a few hours and then came to
a halt without any errors (IO rates just dropped to zero),
with zpool(8) history and other access operations (e.g. ls)
entered an unresponsive D (uninterruptible wait) state
according to ps(1) (although zpool status kept reporting
everything as ONLINE with no errors).  Operations on the
other pool not including the new device were also similarly
unresponsive.

I was not able to kill the processes or have a clean
shutdown and had to power-cycle the system.

Looking at the logs, this time the device wasn't detached
by the controller, but smartd(8) logged some read errors
throughout the day.  But these also kept showing up before
the pool became unresponsive.

zpool status shows no errors and I did a successful scrub
of both pools (primary and backup) after reboot.  Although
the fact that zfs doesn't see the errors may also have to
do with the drive being hidden behind cgd(4).

I don't really know what to make of the errors or the fact
that zfs suddenly became unresponsive, also on the other
pool not including this device.

# uname -a
NetBSD basil 9.2_STABLE NetBSD 9.2_STABLE (GENERIC) #0: Wed Jul 14 18:05:25 UTC 2021  mkrepro%mkrepro.NetBSD.org@localhost:/usr/src/sys/arch/amd64/compile/GENERIC amd64

# cat /var/log/messages
[snip]
Aug 20 06:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 65 to 79 
Aug 20 06:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 72 to 73 
[more of the same]
Aug 20 11:34:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 
Aug 20 12:00:00 basil syslogd[822]: restart
Aug 20 12:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83 
[more of the same]
Aug 20 15:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 79 
Aug 20 15:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 73 to 74 
Aug 20 15:34:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 79 to 82 
Aug 20 16:00:00 basil syslogd[822]: restart
Aug 20 16:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 84 
[more of the same]
Aug 20 18:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 82 
Aug 20 22:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72 
Aug 20 22:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28 

=> this was when I first noticed IO had stopped after
transferring a little short of 1TB during the day.

Aug 21 02:34:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 
Aug 21 02:34:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 
Aug 21 06:15:13 basil syslogd[791]: restart
Aug 21 07:45:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84 
[more of the same]
Aug 21 08:15:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 79 
Aug 21 08:15:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71 
Aug 21 08:15:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 
[more of the same]


Home | Main Index | Thread Index | Old Index