Subject: Re: Soft error on disk write corrupted drive
To: Manuel Bouyer <bouyer@antioche.eu.org>
From: Stuart Brooks <stuartb@cat.co.za>
List: port-i386
Date: 08/31/2007 09:49:05
Manuel Bouyer wrote:
> On Thu, Aug 30, 2007 at 08:53:59PM +0100, David Laight wrote:
>   
>> On Thu, Aug 30, 2007 at 09:27:43PM +0200, Manuel Bouyer wrote:
>>     
>>>> Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 
>>>> 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
>>>> Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
>>>> Aug 18 14:56:01 Connswater1 /netbsd: wd0: soft error (corrected)
>>>>         
>>> Hum, 268435451 = 0xffffffb. This looks like LBA48 lossage.
>>> Maybe this drive doesn't handle properly LBA48 PIO transfers.
>>>       
>> Is this a case where we are doing LBA28 transfers of multiple sectors
>> that cross the boundary ?
>>     
>
> I suspect it is, yes. But the controller may be at fault too here.
>
>   
Thanks for all the posts. Some more information has come to light which 
may be of interest. I have just experienced exactly the same problem on 
another disk and the logs indicate an error within 12 sectors of the 
original error:

Aug 20 20:56:00 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
Aug 20 20:56:00 Connswater2 /netbsd: wd0: (id not found)
Aug 20 20:56:00 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
Aug 20 20:56:00 Connswater2 /netbsd: wd0: (id not found)
Aug 20 20:56:01 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
Aug 20 20:56:01 Connswater2 /netbsd: wd0: (id not found)
Aug 20 20:56:01 Connswater2 /netbsd: wd0g: error writing fsbn 216369072 of 216369072-216369199 (wd0 bn 268435439; cn 266304 tn 15 sn 62), retrying
Aug 20 20:56:01 Connswater2 /netbsd: wd0: (id not found)
Aug 20 20:56:02 Connswater2 /netbsd: wd0: soft error (corrected)

The two disks are the same models and I believe they came from the same batch. They both have identical disklabels on them. The transfers to the disks are in large blocks of 100s of kilobytes. 

I believe the motherboard is the DFI G7V600-B and the controller information from /var/log/messages is as follows:

Aug 17 10:54:23 Connswater2 /netbsd: piixide0 at pci0 dev 31 function 1
Aug 17 10:54:23 Connswater2 /netbsd: piixide0: Intel 82801FB IDE Controller (ICH6) (rev. 0x04)
Aug 17 10:54:23 Connswater2 /netbsd: piixide0: bus-master DMA support present
Aug 17 10:54:23 Connswater2 /netbsd: piixide0: primary channel configured to compatibility mode
Aug 17 10:54:23 Connswater2 /netbsd: piixide0: primary channel interrupting at irq 14
Aug 17 10:54:23 Connswater2 /netbsd: atabus1 at piixide0 channel 0
Aug 17 10:54:23 Connswater2 /netbsd: piixide0: secondary channel configured to compatibility mode
Aug 17 10:54:23 Connswater2 /netbsd: piixide0: secondary channel ignored (disabled)
Aug 17 10:54:23 Connswater2 /netbsd: piixide1 at pci0 dev 31 function 2
Aug 17 10:54:23 Connswater2 /netbsd: piixide1: Intel 82801FB Serial ATA/Raid Controller (rev. 0x04)
Aug 17 10:54:23 Connswater2 /netbsd: piixide1: bus-master DMA support present
Aug 17 10:54:23 Connswater2 /netbsd: piixide1: primary channel configured to native-PCI mode
Aug 17 10:54:23 Connswater2 /netbsd: piixide1: using irq 15 for native-PCI interrupt
Aug 17 10:54:23 Connswater2 /netbsd: atabus0 at piixide1 channel 0
Aug 17 10:54:23 Connswater2 /netbsd: piixide1: secondary channel configured to native-PCI mode
Aug 17 10:54:23 Connswater2 /netbsd: atabus2 at piixide1 channel 1

And the drive info is:

Aug 17 10:54:23 Connswater2 /netbsd: wd0 at atabus0 drive 0: <WDC WD5000AAJS-22TKA0>
Aug 17 10:54:23 Connswater2 /netbsd: wd0: drive supports 16-sector PIO transfers, LBA48 addressing
Aug 17 10:54:23 Connswater2 /netbsd: wd0: 465 GB, 969021 cyl, 16 head, 63 sec, 512 bytes/sect x 976773168 sectors
Aug 17 10:54:23 Connswater2 /netbsd: wd0: 32-bit data port
Aug 17 10:54:23 Connswater2 /netbsd: wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)

Does anyone have any suggestions about what I can do aside from not use these drives? Is it likely that I'd pick up this problem with other large drives?

Thanks a lot,
 Stuart