port-i386: Re: Soft error on disk write corrupted drive

Subject: Re: Soft error on disk write corrupted drive
To: Stuart Brooks <stuartb@cat.co.za>
From: Brian Buhrow <buhrow@lothlorien.nfbcal.org>
List: port-i386
Date: 08/31/2007 08:09:41
	Hello.  We're running these exact same drives, modulo a revision
number, on our backup server, and have been for over a year without
incident.  We're using them in conjunction with Promise SATA controllers
driven by the pdcsata(4) driver.  in an unrelated discussion with my
co-workers this week, it was pointed out that Google wrote a paper a while
back in which they discovered, in contrast to popular belief, that drives
produced by the same manufacturer during the same period of time tend to
fail at the same time.  Consequently, it's possible that you're just seeing
this batch of drives begin to fail.
	However, it's also possible that there's some strange interaction
between your Intel controller and the drives themselves.  Would it be
possible for you to try these drives with a different controller, something
like  the Promise pdc2318, which is a 4-port SATA 1.5GB interface?  If
you're running NetBSD-3.x, you'll want to get the latest 3.1 kernel if you
try this option because the 3.0 kernels contain a fairly buggy version of
the pdcsata(4) driver.  However, I've patched it, and it's been included in
3.1, and the driver seems rock solid now.

Here's what our drives look like:
wd1 at atabus0 drive 0: <WDC WD5000KS-00MNB0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1(pdcsata0:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)

And the version of NetBSD we're runing:

NetBSD fserv1.via.net 3.0_STABLE NetBSD 3.0_STABLE (FSERV1) #0: Wed May 17 14:21:28 PDT 2006  buhrow@lothlorien.nfbcal.org:/usr/src/sys/arch/i386/compile/FSERV1 i386

If you try this, you want /usr/src/sys/dev/pci/pdcsata.c V1.3.2.3 or later.

-Brian
On Aug 31,  9:49am, Stuart Brooks wrote:
} Subject: Re: Soft error on disk write corrupted drive
} Manuel Bouyer wrote:
} > On Thu, Aug 30, 2007 at 08:53:59PM +0100, David Laight wrote:
} >   
} >> On Thu, Aug 30, 2007 at 09:27:43PM +0200, Manuel Bouyer wrote:
} >>     
} >>>> Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 
} >>>> 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
} >>>> Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
} >>>> Aug 18 14:56:01 Connswater1 /netbsd: wd0: soft error (corrected)
} >>>>         
} >>> Hum, 268435451 = 0xffffffb. This looks like LBA48 lossage.
} >>> Maybe this drive doesn't handle properly LBA48 PIO transfers.
} >>>       
} >> Is this a case where we are doing LBA28 transfers of multiple sectors
} >> that cross the boundary ?
} >>     
} >
} > I suspect it is, yes. But the controller may be at fault too here.
} >
} >   
} Thanks for all the posts. Some more information has come to light which 
} may be of interest. I have just experienced exactly the same problem on 
} another disk and the logs indicate an error within 12 sectors of the 
} original error:
} 
} Aug 20 20:56:00 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
} Aug 20 20:56:00 Connswater2 /netbsd: wd0: (id not found)
} Aug 20 20:56:00 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
} Aug 20 20:56:00 Connswater2 /netbsd: wd0: (id not found)
} Aug 20 20:56:01 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
} Aug 20 20:56:01 Connswater2 /netbsd: wd0: (id not found)
} Aug 20 20:56:01 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
} Aug 20 20:56:01 Connswater2 /netbsd: wd0: (id not found)
} Aug 20 20:56:02 Connswater2 /netbsd: wd0: soft error (corrected)
} 
} The two disks are the same models and I believe they came from the same batch. They both have identical disklabels on them. The transfers to the disks are in large blocks of 100s of kilobytes. 
} 
} I believe the motherboard is the DFI G7V600-B and the controller information from /var/log/messages is as follows:
} 
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0 at pci0 dev 31 function 1
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0: Intel 82801FB IDE Controller (ICH6) (rev. 0x04)
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0: bus-master DMA support present
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0: primary channel configured to compatibility mode
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0: primary channel interrupting at irq 14
} Aug 17 10:54:23 Connswater2 /netbsd: atabus1 at piixide0 channel 0
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0: secondary channel configured to compatibility mode
} Aug 17 10:54:23 Connswater2 /netbsd: piixide0: secondary channel ignored (disabled)
} Aug 17 10:54:23 Connswater2 /netbsd: piixide1 at pci0 dev 31 function 2
} Aug 17 10:54:23 Connswater2 /netbsd: piixide1: Intel 82801FB Serial ATA/Raid Controller (rev. 0x04)
} Aug 17 10:54:23 Connswater2 /netbsd: piixide1: bus-master DMA support present
} Aug 17 10:54:23 Connswater2 /netbsd: piixide1: primary channel configured to native-PCI mode
} Aug 17 10:54:23 Connswater2 /netbsd: piixide1: using irq 15 for native-PCI interrupt
} Aug 17 10:54:23 Connswater2 /netbsd: atabus0 at piixide1 channel 0
} Aug 17 10:54:23 Connswater2 /netbsd: piixide1: secondary channel configured to native-PCI mode
} Aug 17 10:54:23 Connswater2 /netbsd: atabus2 at piixide1 channel 1
} 
} And the drive info is:
} 
} Aug 17 10:54:23 Connswater2 /netbsd: wd0 at atabus0 drive 0: <WDC WD5000AAJS-22TKA0>
} Aug 17 10:54:23 Connswater2 /netbsd: wd0: drive supports 16-sector PIO transfers, LBA48 addressing
} Aug 17 10:54:23 Connswater2 /netbsd: wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
} Aug 17 10:54:23 Connswater2 /netbsd: wd0: 32-bit data port
} Aug 17 10:54:23 Connswater2 /netbsd: wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
} 
} Does anyone have any suggestions about what I can do aside from not use these drives? Is it likely that I'd pick up this problem with other large drives?
} 
} Thanks a lot,
}  Stuart
} 
} 
} 
>-- End of excerpt from Stuart Brooks