current-users: Re: HDD - SMART status

Subject: Re: HDD - SMART status
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Jim Bernard <jbernard@mines.edu>
List: current-users
Date: 04/26/2003 08:45:49
On Sat, Apr 26, 2003 at 03:42:18PM +0200, Manuel Bouyer wrote:
> On Sat, Apr 26, 2003 at 01:54:16AM +0200, Tomasz Luchowski wrote:
> > Hi,
> > in syslog:
> > 
> > Apr 26 01:42:48 zunpc /netbsd: pciide0:0:0: lost interrupt
> > Apr 26 01:43:00 zunpc /netbsd:  type: ata tc_bcount: 16384 tc_skip: 0
> > Apr 26 01:43:00 zunpc /netbsd: pciide0:0:0: bus-master DMA error: missing interr
> > upt, status=0x21
> > Apr 26 01:43:00 zunpc /netbsd: wd0e: DMA error reading fsbn 13069678 of 13069678
> > -13069709 (wd0 bn 13069741; cn 12966 tn 0 sn 13), retrying
> > Apr 26 01:43:00 zunpc /netbsd: wd0: soft error (corrected)
> 
> This looks more like a problem on the bus, rather than with the disk itself

  FWIW, I have an IBM 60gxp drive that develops a new bad block or two about
every 6 months.  And the messages look somewhat similar to the messages above.
Here's how it looked the last time:

Apr 13 00:32:52: wd2e: error reading fsbn 25583984 of 25583984-0 (wd2 bn 26633312; cn 26421 tn 14 sn 62), retrying
Apr 13 00:32:52: wd2: (uncorrectable data error)
...
Apr 13 00:33:08: wd2: transfer error, downgrading to Ultra-DMA mode 2
Apr 13 00:33:08: wd2(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA data transfers)
Apr 13 00:33:08: wd2e: error reading fsbn 25583984 of 25583984-0 (wd2 bn 26633312; cn 26421 tn 14 sn 62), retrying
Apr 13 00:33:08: wd2: (uncorrectable data error)
...
Apr 13 00:33:30: pciide1:0:0: lost interrupt
Apr 13 00:33:30:    type: ata tc_bcount: 8192 tc_skip: 0
Apr 13 00:33:30: pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
Apr 13 00:33:30: wd2: transfer error, downgrading to Ultra-DMA mode 1
Apr 13 00:33:30: wd2(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 1 (using DMA data transfers)
Apr 13 00:33:30: wd2e: error reading fsbn 25583984 of 25583984-0 (wd2 bn 26633312; cn 26421 tn 14 sn 62), retrying
Apr 13 00:33:30: wd2: (uncorrectable data error)
Apr 13 00:33:35: wd2: transfer error, downgrading to DMA mode 2
Apr 13 00:33:35: wd2(pciide1:0:0): using PIO mode 4, DMA mode 2 (using DMA data transfers)
Apr 13 00:33:35: wd2e: error reading fsbn 25583984 of 25583984-0 (wd2 bn 26633312; cn 26421 tn 14 sn 62), retrying
Apr 13 00:33:35: wd2: (uncorrectable data error)
Apr 13 00:33:46: pciide1:0:0: lost interrupt
Apr 13 00:33:46:    type: ata tc_bcount: 8192 tc_skip: 0
Apr 13 00:33:46: pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21
Apr 13 00:33:46: wd2: transfer error, downgrading to PIO mode 4
Apr 13 00:33:46: wd2(pciide1:0:0): using PIO mode 4
Apr 13 00:33:46: wd2e: error reading fsbn 25583984 of 25583984-0 (wd2 bn 26633312; cn 26421 tn 14 sn 62), retrying
Apr 13 00:33:46: wd2: (uncorrectable data error)

  And it limps along at PIO-4 thereafter.

  So the driver does seem to think there are bus problems, even though the
real source of the problem is apparently the bad block.  Each time this has
happened (4 times so far), I've run the IBM "disk fitness test" utility to
reformat the drive (thus mapping out the back block(s)), and it has then
worked perfectly for several months at UDMA-5, until the next bad block
appeared.

  In case there's any lingering doubt: This drive has resided on two different
controllers and had two or three different cables attached (none of which has
given any sign of problems on any other drive), and it has exhibited this
behavior since about a year after I bought it.  The other two drives in the
system have had no problems.  So, I don't think it's a cable/bus problem (in
my case), but just a flaky drive.

--Jim