port-alpha: Raid Problems (URGEND)

Subject: Raid Problems (URGEND)
To: NetBSD alpha <port-alpha@netbsd.org>
From: Uwe Lienig <uwe.lienig@fif.mw.htw-dresden.de>
List: port-alpha
Date: 08/29/2006 14:59:27
Hello alpha fellows,

the problem with this RAID came, when 2 components of a raid array (with netbsd 
raid frame - software raid) failed within a short time.

I send this to Greg too since he was helping me the last time I had trouble with 
raid.

OS specific infos:
netbsd 1.6.2

hardware:
DEC3000/300
2xtcds scsi adapters

The system is off site and I don't have direct access, but I can phone the site 
to issue commands on the console.

Excerpt from boot message
/netbsd: DEC 3000 - M300, 150MHz, s/n
/netbsd: 8192 byte page size, 1 processor.
/netbsd: total memory = 256 MB
/netbsd: tcds0 at tc0 slot 4 offset 0x0: TurboChannel Dual SCSI (baseboard)
/netbsd: asc0 at tcds0 chip 0: NCR53C94, 25MHz, SCSI ID 7
/netbsd: scsibus0 at asc0: 8 targets, 8 luns per target
/netbsd: tcds2 at tc0 slot 1 offset 0x0: TurboChannel Dual SCSI
/netbsd: tcds2: fast mode set for chip 0
/netbsd: asc3 at tcds2 chip 0: NCR53C96, 40MHz, SCSI ID 7
/netbsd: scsibus3 at asc3: 8 targets, 8 luns per target
/netbsd: tcds2: fast mode set for chip 1
/netbsd: asc4 at tcds2 chip 1: NCR53C96, 40MHz, SCSI ID 7
/netbsd: scsibus4 at asc4: 8 targets, 8 luns per target
/netbsd: tcds1 at tc0 slot 0 offset 0x0: TurboChannel Dual SCSI
/netbsd: tcds1: fast mode set for chip 0
/netbsd: asc1 at tcds1 chip 0: NCR53C96, 40MHz, SCSI ID 7
/netbsd: scsibus1 at asc1: 8 targets, 8 luns per target
/netbsd: tcds1: fast mode set for chip 1
/netbsd: asc2 at tcds1 chip 1: NCR53C96, 40MHz, SCSI ID 7
/netbsd: scsibus2 at asc2: 8 targets, 8 luns per target
/netbsd: scsibus0: waiting 2 seconds for devices to settle...
/netbsd: sd10 at scsibus3 target 0 lun 0: <IBM, DDYS-T18350N, S96H> SCSI3 
0/direct fixed
/netbsd: sd10: 17501 MB, 15110 cyl, 6 head, 395 sec, 512 bytes/sect x 35843670 
sectors
/netbsd: sd10: sync (100.0ns offset 15), 8-bit (10.000MB/s) transfers, tagged 
queueing
:
( then infos for sd11, sd12, sd13, sd30, sd31, sd32, sd33, hard wired in the
   kernel config to these scsi devices )

There is a disk (sd0) for the OS, a 4GByte Barracuda. The data is stored in a 
raid5. The raid consists of 6 identical IBM DDYS-T18350 (sd1[0-2], sd3[0-2]) 
plus a spare configured into the raid (sd13b) and a cold spare (sd33) (total 8 
disks).

Everything worked ok for two years now. But during weekend 26/27 aug 2006 two 
disks (sd30 and sd31) failed.

Prior to failure the raid config was as follows.
Original config (raid0)

START array
1 6 1

START disks
/dev/sd10b
/dev/sd11b
/dev/sd12b
/dev/sd30b
/dev/sd31b
/dev/sd32b

START spare
/dev/sd13b

After the raid was initially created two years ago, auto config was switched on.

raidctl -A yes /dev/raid0

Step 1
--------------------------------

When sd30 (about 12 hours before sd31) and sd31 failed, the system went down. 
After that the raid couldn't be configured any more, raid0 was missing.

Due to the failure of 2 disks the raid couldn't manage to come up again. raidctl 
-c failed (incorrect modification counter)

First I tried to get the raid going by reconfiguring via

raidctl -C /etc/raid0.conf raid0

After that the raid came up and a /dev/raid0 was accessible. I had the hope that 
the read errors of sd31 would not persist and tried to fail sd30.

raidctl -F /dev/sd30b raid0

This caused a panic since sd31 produced hard errors again.
_____________________________________________________________

Step 2
------------------------------------

To get the raid going again, I decided to copy sd31 to sd33 (the cold spare). 
This would allow the raid to come up since there will be no hard errors. To copy 
I used (all disks are identical)

dd if=/dev/rsd31c bs=1b conv=noerror,sync of=/dev/rsd33c

I know, that there will be some blocks with wrong infos in them (dd will produce 
blocks filled with null bytes on read errors). sd30 remains as failed. But sd31 
will not produce read errors anymore. Thus the building of the raid will succeed.

Then I edited /etc/raid0.conf and changed sd31 to sd33 looking as

START disks
/dev/sd10b
/dev/sd11b
/dev/sd12b
/dev/sd30b
# changed sd31 to sd33
/dev/sd33b
/dev/sd32b

I didn't change the spare line.

After a reboot the raid came up correctly and was configured automagically. 
Since all the filesystems that where on the raid were commented out the raid 
remained untouched after configuration.

raidctl -s /dev/raid0

showed

            /dev/sd10b: optimal
            /dev/sd11b: optimal
            /dev/sd12b: optimal
            /dev/sd30b: failed
            /dev/sd31b: optimal
            /dev/sd32b: optimal
            spares: no spares
and
            Parity status: dirty
            Reconstruction is 100% complete.
            Parity Re-write is 100% complete.
            Copyback is 100% complete.


Two questions: why is sd31 not replaced by sd33? Why is there no spare? Where is 
sd13 gone? raidctl -F /dev/sd30b raid0 didn't succeed due to the immediate panic 
in step 1.
_____________________________________________________________

Step 3
-------------------------------------------------------------

I was sure that sd13 wasn't used so i added sd13 again:

raidctl -a /dev/sd13b /dev/raid0

Then I initiated reconstruction again
raidctl -F /dev/sd13b /dev/raid0

The system paniced again.
_____________________________________________________________

Step 4
-------------------------------------------------------------
After reboot the system configured the raid. Now I have

raidctl -s /dev/raid0

/dev/sd10b: optimal
/dev/sd11b: optimal
/dev/sd12b: optimal
/dev/sd13b: optimal
/dev/sd31b: failed
/dev/sd32b: optimal
spares: no spares

Where is sd33, why has sd31 failed? sd31 was replaced by sd33 in the config file 
and should be optimal.

Now I tried to check the file system on the raid although the raid is not 
completely functional.

fsck -n -f /dev/raid0{a,b,d,e,f}

Some file systems have more, some less erros. Basically it seems normal from the 
file system point of view. But I don't know what state the raid is in?

I'm stuck at this point as to what to do know. I really like to get the raid 
going again. I've ordered new drives already. But I'd like to bring the raid 
back into a state that will allow for correct operation again without 
reconstructing everything from scratch. Yes, I have a backup on tape (although 
not the newest one since the last backup on Friday 25th prior to this crash 
didn't made it). So the backup is from two weeks ago.

I see this as a test case for dealing with those errors on raid sets.

Since this is a file server I have to manage that the system gets up again as 
quick as possible.

Thank you all for your input.

-- 


Uwe Lienig
----------
fon: (+49 351) 462 2780
fax: (+49 351) 462 3476
mailto:uwe.lienig@fif.mw.htw-dresden.de

Forschungsinstitut Fahrzeugtechnik
<http://www.fif.mw.htw-dresden.de>
parcels: Gutzkowstr. 22, 01069 Dresden
letters: PF 12 07 01,    01008 Dresden

Hochschule für Technik und Wirtschaft Dresden (FH)
Friedrich-List-Platz 1, 01069 Dresden