Subject: Re: Soft error on disk write corrupted drive
To: Stuart Brooks <stuartb@cat.co.za>
From: Manuel Bouyer <bouyer@antioche.eu.org>
List: port-i386
Date: 08/30/2007 21:27:43
On Thu, Aug 30, 2007 at 10:36:35AM +0200, Stuart Brooks wrote:
> Hi,
> 
> I have picked up a very concerning problem on NetBSD 3.1_RC2 involving a 
> corrected soft error following an "error writing fsbn".
> 
> The short version:
> A disk write which was directed to the rwd0g partition reported the 
> "error writing fsbn" with "id not found" a few times before succeeding 
> (we believed) with "soft error (corrected)". However the write actually 
> ended up taking place to sector 0 on *wd0d*, trashing the disk. The data 
> never made its way onto the wd0g partition.
> 
> The longer version:
> 
> The g partition is used as a raw file system and I write structures 
> sequentially into it. Every structure contains a magic number, timestamp 
> and offset which can be used to check the validity. The following error 
> was seen in the logs at the time when the problem occurred:
> 
> Aug 18 14:55:59 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 
> 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
> Aug 18 14:55:59 Connswater1 /netbsd: wd0: (id not found)
> Aug 18 14:55:59 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 
> 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
> Aug 18 14:55:59 Connswater1 /netbsd: wd0: (id not found)
> Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 
> 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
> Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
> Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 
> 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
> Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
> Aug 18 14:56:01 Connswater1 /netbsd: wd0: soft error (corrected)

Hum, 268435451 = 0xffffffb. This looks like LBA48 lossage.
Maybe this drive doesn't handle properly LBA48 PIO transfers.

What kind of controller is it ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--