current-users: Re: HDD - SMART status

Subject: Re: HDD - SMART status
To: Jim Bernard <jbernard@mines.edu>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: current-users
Date: 04/27/2003 18:38:44
On Sat, Apr 26, 2003 at 08:45:49AM -0600, Jim Bernard wrote:
> On Sat, Apr 26, 2003 at 03:42:18PM +0200, Manuel Bouyer wrote:
> > On Sat, Apr 26, 2003 at 01:54:16AM +0200, Tomasz Luchowski wrote:
> > > Hi,
> > > in syslog:
> > > 
> > > Apr 26 01:42:48 zunpc /netbsd: pciide0:0:0: lost interrupt
> > > Apr 26 01:43:00 zunpc /netbsd:  type: ata tc_bcount: 16384 tc_skip: 0
> > > Apr 26 01:43:00 zunpc /netbsd: pciide0:0:0: bus-master DMA error: missing interr
> > > upt, status=0x21
> > > Apr 26 01:43:00 zunpc /netbsd: wd0e: DMA error reading fsbn 13069678 of 13069678
> > > -13069709 (wd0 bn 13069741; cn 12966 tn 0 sn 13), retrying
> > > Apr 26 01:43:00 zunpc /netbsd: wd0: soft error (corrected)
> > 
> > This looks more like a problem on the bus, rather than with the disk itself
> 
>   FWIW, I have an IBM 60gxp drive that develops a new bad block or two about
> every 6 months.  And the messages look somewhat similar to the messages above.
> Here's how it looked the last time:
> 
> Apr 13 00:32:52: wd2e: error reading fsbn 25583984 of 25583984-0 (wd2 bn 26633312; cn 26421 tn 14 sn 62), retrying
> Apr 13 00:32:52: wd2: (uncorrectable data error)

No it's not the same error, not the "uncorrectable data error" here.
This is an error reported by the drive, and the drive says it couldn't read
the data.

The error reported by Tomasz is a DMA protocol error between the drive and
the host. The drive itself didn't report an error (so maybe it could read
the data fine), but the DMA engine said it failed to transfer the data
from the drive. This is a problem on the bus.

> ...
>[...] 
>   And it limps along at PIO-4 thereafter.
> 
>   So the driver does seem to think there are bus problems, even though the
> real source of the problem is apparently the bad block.

The driver knows this is a disk problem. However, I've seen situations
(probably related to power supply problems) where a drive had random
read errors at hight-DMA speed, but worked properly in PIO mode.
I can see 2 reasons for this
1) the drive has a buggy firmware, and the bug shows up in DMA modes only
2) there is not enouth power for the drive, which causes it to report media
   errors. However, downgrading to PIO modes cause the drive to suck less
   power (because it has longer idle periods between commands).

This is why I didn't prevent the downgrade in case of errors reported which
are clearly problems on the disk side. In some cases it works around the
problem.
Of course the real problem here is that the IDE interface can only report
a few errors condition, so the drive can say it failed, but it can't say
why it failed.

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 24 ans d'experience feront toujours la difference
--