port-i386: Soft error on disk write corrupted drive

Subject: Soft error on disk write corrupted drive
To: None <port-i386@NetBSD.org>
From: Stuart Brooks <stuartb@cat.co.za>
List: port-i386
Date: 08/30/2007 10:36:35
Hi,

I have picked up a very concerning problem on NetBSD 3.1_RC2 involving a 
corrected soft error following an "error writing fsbn".

The short version:
A disk write which was directed to the rwd0g partition reported the 
"error writing fsbn" with "id not found" a few times before succeeding 
(we believed) with "soft error (corrected)". However the write actually 
ended up taking place to sector 0 on *wd0d*, trashing the disk. The data 
never made its way onto the wd0g partition.

The longer version:

The g partition is used as a raw file system and I write structures 
sequentially into it. Every structure contains a magic number, timestamp 
and offset which can be used to check the validity. The following error 
was seen in the logs at the time when the problem occurred:

Aug 18 14:55:59 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:55:59 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:55:59 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:55:59 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:56:01 Connswater1 /netbsd: wd0: soft error (corrected)

We managed to recover the a partition of the disk in order to get the logs.

After the corruption had taken place we reviewed a dump of the first few meg of the disk (wd0d) and a hexdump revealed that the structures destined for the g partition appeared mysteriously over the initial sectors of the drive.

The timestamp of the structures matched the time of the error exactly and the offset in the structure I wrote was 20 sectors beyond the point of the error (as reported in the error message) - the writes are normally large blocks of a few 100 kilobytes. The data was never written into the wd0g partition, a dump of it was checked and it yielded 60kB of zero-filled data amongst the normal data.


I have added some additional information to this e-mail:

1. atactl of drive
2. disklabel before the problem
3. disklabel after the problem (corrupted)

Does anyone have any idea how this could have happened? Ideas I've had are:

1. A problem with the rewrite attempt in NetBSD
2. A corruption on the PCI transfer
3. An error on the drive
    - an incorrect sector write
    - a failed reallocation

Any help would be much appreciated...

Thanks
 Stuart



>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Extra info:

 
An 'atactl identify' of the drive is as follows:

Model: WDC WD5000AAJS-22TKA0, Rev: 12.01C01, Serial #:      WD-WCAPW3229413
Device type: ATA, fixed
Device supports command queue depth of 15
Device capabilities:
        DMA
        LBA
        ATA standby timer values
        IORDY operation
        IORDY disabling
Device supports following standards:
ATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7
Command set support:
        NOP command (enabled)
        READ BUFFER command (enabled)
        WRITE BUFFER command (enabled)
        Host Protected Area feature set (enabled)
        look-ahead (enabled)
        write cache (enabled)
        Power Management feature set (enabled)
        Security Mode feature set (disabled)
        SMART feature set (enabled)
        FLUSH CACHE EXT command (enabled)
        FLUSH CACHE command (enabled)
        Device Configuration Overlay feature set (enabled)
        48-bit Address feature set (enabled)
        Automatic Acoustic Management feature set (disabled)
        SET MAX security extension (disabled)
        SET FEATURES required to spin-up after power-up (enabled)
        Power-Up In Standby feature set (disabled)
        DOWNLOAD MICROCODE command (enabled)
        World Wide name
        General Purpose Logging feature set
        SMART self-test
        SMART error logging

The disklabel of the drive beforehand:

# /dev/rwd1d:
type: ESDI
disk: Autowd0
label: 200708280820040f
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 969020
total sectors: 976772160
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0           # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0

7 partitions:
#        size    offset     fstype [fsize bsize cpg/sgs]
 a:   3072000        63     4.2BSD   1024  8192 46552  # (Cyl.      0*-   3047*)
 b:   1228800   3072063       swap                     # (Cyl.   3047*-   4266*)
 c: 976773105        63     unused      0     0        # (Cyl.      0*- 969020+)
 e:  10240000   4300863     4.2BSD   1024  8192 46552  # (Cyl.   4266*-  14425*)
 f:  37525504  14540863     4.2BSD   1024  8192 55192  # (Cyl.  14425*-  51653*)
 g: 922779648  52066367     4.2BSD   1024  8192 55200  # (Cyl.  51653*- 967109*)


And the disklabel after the corruption was as follows:

type: ESDI
disk: WDC WD5000AAJS-2
label: fictitious
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 969021
total sectors: 976773168
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0           # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0

4 partitions:
#        size    offset     fstype [fsize bsize cpg/sgs]
 a: 976773168         0     4.2BSD      0     0     0  # (Cyl.      0 - 969020)
 d: 976773168         0     unused      0     0        # (Cyl.      0 - 969020)