Subject: Re: some problems with "old" RAIDframe arrays on netbsd-1-6
To: Greg Oster <oster@cs.usask.ca>
From: Greg A. Woods <woods@weird.com>
List: tech-kern
Date: 10/19/2003 19:13:50
[ On Sunday, October 19, 2003 at 15:13:27 (-0600), Greg Oster wrote: ]
> Subject: Re: some problems with "old" RAIDframe arrays on netbsd-1-6 
>
> Can you send the output of "raidctl -s raid0" and of "disklabel foo0" 
> where "foo0" contains one of the active components?  RAIDframe is 
> usually only crabby about these sorts of things if there is an actual 
> size difference that will cause a problem.

Note that if I'm reading the kernel message right it was apparently
seeing the spare partition as only 512 sectors:

 	Spare disk /dev/sd6a (512 blocks) is too small to serve as a spare (need 8890688 blocks)

The underlying problem I'm having which requires a new component on this
array is that one of the disks, and/or SCSI isolators in the hot swap
bays, is going bad.  I either get a bus parity error, or something like:

	sd12(ahc1:0:8:0): Unexpected busfree in Data-in phase
	SEQADDR == 0x113
	sd12(ahc1:0:8:0): generic HBA error

Of course after a reboot all the disks on that shelf look fine until the
drive gets used a bit.  Indeed it will usually work long enough to do a
full reconstruct.

This time when I rebooted sd12d originally looked "optimal", but still
had "autoconfig" disabled in its component label so I had to manually
configure raid0 to get it working.

So while still in single user mode I failed /dev/sd12d and reconstructed
right back to it (-R), then did a forced fsck just to make sure
everything was OK, and it was.

Note too that when I did the manual re-configure the new "sd6d" appeared
as a spare as expected because I now had it listed in the raid0.conf
file, but perhaps since it wasn't formally added, and since the array
was already initialized, it disappeared again on the next reboot.

Anyway here's how it looked after I re-constructed sd12d in single user:

# raidctl -v -s raid0
Components:
           /dev/sd7d: optimal
           /dev/sd8d: optimal
           /dev/sd9d: optimal
          /dev/sd10d: optimal
          /dev/sd11d: optimal
          /dev/sd12d: optimal
Spares:
           /dev/sd6d: spare
Component label for /dev/sd7d:
   Row: 0, Column: 0, Num Rows: 1, Num Columns: 6
   Version: 2, Serial Number: 2, Mod Counter: 533
   Clean: No, Status: 0
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 8890688
   RAID Level: 5
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid0
Component label for /dev/sd8d:
   Row: 0, Column: 1, Num Rows: 1, Num Columns: 6
   Version: 2, Serial Number: 2, Mod Counter: 533
   Clean: No, Status: 0
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 8890688
   RAID Level: 5
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid0
Component label for /dev/sd9d:
   Row: 0, Column: 2, Num Rows: 1, Num Columns: 6
   Version: 2, Serial Number: 2, Mod Counter: 533
   Clean: No, Status: 0
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 8890688
   RAID Level: 5
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid0
Component label for /dev/sd10d:
   Row: 0, Column: 3, Num Rows: 1, Num Columns: 6
   Version: 2, Serial Number: 2, Mod Counter: 533
   Clean: No, Status: 0
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 8890688
   RAID Level: 5
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid0
Component label for /dev/sd11d:
   Row: 0, Column: 4, Num Rows: 1, Num Columns: 6
   Version: 2, Serial Number: 2, Mod Counter: 533
   Clean: No, Status: 0
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 8890688
   RAID Level: 5
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid0
Component label for /dev/sd12d:
   Row: 0, Column: 5, Num Rows: 1, Num Columns: 6
   Version: 2, Serial Number: 2, Mod Counter: 533
   Clean: No, Status: 0
   sectPerSU: 32, SUsPerPU: 1, SUsPerRU: 1
   Queue size: 100, blocksize: 512, numBlocks: 8890688
   RAID Level: 5
   Autoconfig: Yes
   Root partition: No
   Last configured as: raid0
/dev/sd6d status is: spare.  Skipping label.
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.


It didn't take long before sd12 failed again with the scsi error shown
above appearing on the console, followed of course by:

	raid0: IO Error.  Marking /dev/sd12d as failed.
	raid0: node (Rod) returned fail, rolling backward
	sd12(ahc1:0:8:0): generic HBA error
	raid0: Disk /dev/sd12d is already marked as dead!
	raid0: node (Rod) returned fail, rolling backward
	raid0: DAG failure: w addr 0x13937df (20527071) nblk 0x20 (32) buf 0xc5a4e000
	raid0: DAG failure: w addr 0x13874df (20477151) nblk 0x20 (32) buf 0xc5a42000

Now as I mentioned after the second reboot sd6d disappeared as a spare
again as well:

	# raidctl -v -s raid0
	Components:
	           /dev/sd7d: optimal
	           /dev/sd8d: optimal
	           /dev/sd9d: optimal
	          /dev/sd10d: optimal
	          /dev/sd11d: optimal
	          /dev/sd12d: failed
	No spares.
	Component label for /dev/sd7d:
	[[ ... snip ... ]]
	/dev/sd12d status is: failed.  Skipping label.
	Parity status: clean
	Reconstruction is 100% complete.
	Parity Re-write is 100% complete.
	Copyback is 100% complete.



Here's the disklabel from the first of the original components:

	# disklabel sd7
	# /dev/rsd7d:
	type: SCSI
	disk: QUANTUM_X34550WD
	label: 
	flags:
	bytes/sector: 512
	sectors/track: 150
	tracks/cylinder: 10
	sectors/cylinder: 1500
	cylinders: 5899
	total sectors: 8890760
	rpm: 7200
	interleave: 1
	trackskew: 0
	cylinderskew: 0
	headswitch: 0           # microseconds
	track-to-track seek: 0  # microseconds
	drivedata: 0 
	
	4 partitions:
	#        size    offset     fstype  [fsize bsize cpg/sgs]
	 d:   8890760         0       RAID                      # (Cyl.    0 - 5927*)

(Note that before I upgraded the fstype was "unused" though as I said it
still auto-configured, but after I upgraged I had to change it to "RAID"
of course to get autoconfig to work on these "old" arrays.)

Here's the disklabel from sd6, which I'm about to try re-adding as a
spare again:

	# disklabel sd6 
	# /dev/rsd6d:
	type: SCSI
	disk: VIKING 4.5 WSE
	label: raid0-spare
	flags:
	bytes/sector: 512
	sectors/track: 181
	tracks/cylinder: 8
	sectors/cylinder: 1448
	cylinders: 6144
	total sectors: 8896512
	rpm: 7200
	interleave: 1
	trackskew: 0
	cylinderskew: 0
	headswitch: 0           # microseconds
	track-to-track seek: 0  # microseconds
	drivedata: 0 
	
	4 partitions:
	#        size    offset     fstype  [fsize bsize cpg/sgs]
	 d:   8896512         0       RAID                      # (Cyl.    0 - 6143)

(Note this time I've just left it at its full-disk size...)

This time of course it works for some unexplained reason:

	# raidctl -v -a /dev/sd6e raid0
	#

and from the console:

	Warning: truncating spare disk /dev/sd6d to 8890688 blocks

So, whatever the problem was it's not easily reproducible.

It's now beginning the reconstruction.....

(and perhaps because sd6 is on a separate bus from sd7-sd12 it is now
claiming only 21 minutes -- about half the time it took to reconstruct
to sd12 before!)

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>