Subject: Re: wd interface CRC errors
To: David Maxwell <david@vex.net>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: netbsd-help
Date: 10/21/2006 17:57:14
On Fri, Oct 20, 2006 at 12:27:51PM -0400, David Maxwell wrote:
> 
> I have a NetBSD 2.0.2, i386, with an uptime of 238 days. The install
> is about a year and a half old, running 24/7 since May 2005.
> 
> It has three WDs, and wd1/wd2 are a mirror. Nothing has changed
> physically in the system, so the usual 'cable problem' suggestion 
> doesn't seem to apply.

Well, it can come from other sources too. Greg suggested power supply,
and I got this once. The PSU performances can be degraded after some time,
I assume because of aging chemical capacity. It's also possible that
drives needs more power when getting older (more friction in the mechanics).
CRC errors can also be caused by some changes in the electromagnetic
environnement of the box. Connectors can also cause issues after some times
(I got this once for SCSI devices: a box which has run fine for years has
started to show SCSI commutication issues after a failure of a few hours
of air cooling in the room. I couldn't get it stable again and had to
remplace the SCSI cable).
So I would still try to first remplace the cables, if it doesn't help
try a stronger power supply.

> 
> These started in September, and have become common:
> 
> Sep 19 08:58:28 mail /netbsd: wd0a: error writing fsbn 1114304 of 1114304-1114319 (wd0 bn 1114367; cn 1105 tn 8 sn 23), retrying
> Sep 19 08:58:29 mail /netbsd: wd0: (aborted command, interface CRC error)
> Sep 19 08:58:29 mail /netbsd: wd0: soft error (corrected)
> 
> They show up on wd0 and wd1, which share a controller, and a cable. 
> All errors are corrected so far. (18 on wd0, 19 on wd1) 
> 
> All of the errors occur while writing, and there's no locality 
> amongst the sectors invovled in the errored writes.
> 
> The smart status shows a high count on wd0 for raw read error rate and
> hardware ECC recovered errors, so I'm inclined to replace that drive.

This could also be a sign of power supply problem.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--