Subject: Soft error on disk write corrupted drive
To: None <port-i386@NetBSD.org>
From: Stuart Brooks <stuartb@cat.co.za>
List: port-i386
Date: 08/30/2007 10:36:35
Hi,
I have picked up a very concerning problem on NetBSD 3.1_RC2 involving a
corrected soft error following an "error writing fsbn".
The short version:
A disk write which was directed to the rwd0g partition reported the
"error writing fsbn" with "id not found" a few times before succeeding
(we believed) with "soft error (corrected)". However the write actually
ended up taking place to sector 0 on *wd0d*, trashing the disk. The data
never made its way onto the wd0g partition.
The longer version:
The g partition is used as a raw file system and I write structures
sequentially into it. Every structure contains a magic number, timestamp
and offset which can be used to check the validity. The following error
was seen in the logs at the time when the problem occurred:
Aug 18 14:55:59 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:55:59 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:55:59 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:55:59 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:56:00 Connswater1 /netbsd: wd0g: error writing fsbn 216369084 of 216369084-216369211 (wd0 bn 268435451; cn 266305 tn 0 sn 11), retrying
Aug 18 14:56:00 Connswater1 /netbsd: wd0: (id not found)
Aug 18 14:56:01 Connswater1 /netbsd: wd0: soft error (corrected)
We managed to recover the a partition of the disk in order to get the logs.
After the corruption had taken place we reviewed a dump of the first few meg of the disk (wd0d) and a hexdump revealed that the structures destined for the g partition appeared mysteriously over the initial sectors of the drive.
The timestamp of the structures matched the time of the error exactly and the offset in the structure I wrote was 20 sectors beyond the point of the error (as reported in the error message) - the writes are normally large blocks of a few 100 kilobytes. The data was never written into the wd0g partition, a dump of it was checked and it yielded 60kB of zero-filled data amongst the normal data.
I have added some additional information to this e-mail:
1. atactl of drive
2. disklabel before the problem
3. disklabel after the problem (corrupted)
Does anyone have any idea how this could have happened? Ideas I've had are:
1. A problem with the rewrite attempt in NetBSD
2. A corruption on the PCI transfer
3. An error on the drive
- an incorrect sector write
- a failed reallocation
Any help would be much appreciated...
Thanks
Stuart
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Extra info:
An 'atactl identify' of the drive is as follows:
Model: WDC WD5000AAJS-22TKA0, Rev: 12.01C01, Serial #: WD-WCAPW3229413
Device type: ATA, fixed
Device supports command queue depth of 15
Device capabilities:
DMA
LBA
ATA standby timer values
IORDY operation
IORDY disabling
Device supports following standards:
ATA-1 ATA-2 ATA-3 ATA-4 ATA-5 ATA-6 ATA-7
Command set support:
NOP command (enabled)
READ BUFFER command (enabled)
WRITE BUFFER command (enabled)
Host Protected Area feature set (enabled)
look-ahead (enabled)
write cache (enabled)
Power Management feature set (enabled)
Security Mode feature set (disabled)
SMART feature set (enabled)
FLUSH CACHE EXT command (enabled)
FLUSH CACHE command (enabled)
Device Configuration Overlay feature set (enabled)
48-bit Address feature set (enabled)
Automatic Acoustic Management feature set (disabled)
SET MAX security extension (disabled)
SET FEATURES required to spin-up after power-up (enabled)
Power-Up In Standby feature set (disabled)
DOWNLOAD MICROCODE command (enabled)
World Wide name
General Purpose Logging feature set
SMART self-test
SMART error logging
The disklabel of the drive beforehand:
# /dev/rwd1d:
type: ESDI
disk: Autowd0
label: 200708280820040f
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 969020
total sectors: 976772160
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
7 partitions:
# size offset fstype [fsize bsize cpg/sgs]
a: 3072000 63 4.2BSD 1024 8192 46552 # (Cyl. 0*- 3047*)
b: 1228800 3072063 swap # (Cyl. 3047*- 4266*)
c: 976773105 63 unused 0 0 # (Cyl. 0*- 969020+)
e: 10240000 4300863 4.2BSD 1024 8192 46552 # (Cyl. 4266*- 14425*)
f: 37525504 14540863 4.2BSD 1024 8192 55192 # (Cyl. 14425*- 51653*)
g: 922779648 52066367 4.2BSD 1024 8192 55200 # (Cyl. 51653*- 967109*)
And the disklabel after the corruption was as follows:
type: ESDI
disk: WDC WD5000AAJS-2
label: fictitious
flags:
bytes/sector: 512
sectors/track: 63
tracks/cylinder: 16
sectors/cylinder: 1008
cylinders: 969021
total sectors: 976773168
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0 # microseconds
track-to-track seek: 0 # microseconds
drivedata: 0
4 partitions:
# size offset fstype [fsize bsize cpg/sgs]
a: 976773168 0 4.2BSD 0 0 0 # (Cyl. 0 - 969020)
d: 976773168 0 unused 0 0 # (Cyl. 0 - 969020)