port-alpha: Re: Raid Problems (URGEND)

Subject: Re: Raid Problems (URGEND)
To: Uwe Lienig <uwe.lienig@fif.mw.htw-dresden.de>
From: Greg Oster <oster@cs.usask.ca>
List: port-alpha
Date: 08/29/2006 10:36:24
Uwe Lienig writes:
> Hello alpha fellows,
>=20
> the problem with this RAID came, when 2 components of a raid array (wit=
h netbsd=20
> raid frame - software raid) failed within a short time.

Unfortunately, RAIDframe isn't built to deal with such problems...=20
and the version you're running (in 1.6.2) is much less forgiving=20
about 2-component errors than, say, what is in 3.0 or -current.

> I send this to Greg too since he was helping me the last time I had tro=
uble w
> ith=20
> raid.
>=20
> OS specific infos:
> netbsd 1.6.2
>=20
> hardware:
> DEC3000/300
> 2xtcds scsi adapters
>=20
> The system is off site and I don't have direct access, but I can phone =
the si
> te=20
> to issue commands on the console.
>=20
=5Bsnip=5D
> :
> ( then infos for sd11, sd12, sd13, sd30, sd31, sd32, sd33, hard wired i=
n the
>    kernel config to these scsi devices )
>=20
> There is a disk (sd0) for the OS, a 4GByte Barracuda. The data is store=
d in a
> =20
raid5. The raid consists of 6 identical IBM DDYS-T18350 (sd1=5B0-2=5D, sd=
3=5B0-2=5D)=20
> plus a spare configured into the raid (sd13b) and a cold spare (sd33) (=
total=20
> 8=20
> disks).
>=20
> Everything worked ok for two years now. But during weekend 26/27 aug 20=
06 two
> =20
> disks (sd30 and sd31) failed.
>=20
> Prior to failure the raid config was as follows.
> Original config (raid0)
>=20
> START array
> 1 6 1
>=20
> START disks
> /dev/sd10b
> /dev/sd11b
> /dev/sd12b
> /dev/sd30b
> /dev/sd31b
> /dev/sd32b
>=20
> START spare
> /dev/sd13b
>=20
> After the raid was initially created two years ago, auto config was swi=
tched=20
> on.
>=20
> raidctl -A yes /dev/raid0
>=20
> Step 1
> --------------------------------
>=20
> When sd30 (about 12 hours before sd31) and sd31 failed, the system went=
 down.
> =20
> After that the raid couldn't be configured any more, raid0 was missing.=


Right...=20

> Due to the failure of 2 disks the raid couldn't manage to come up again=
. raid
> ctl=20
> -c failed (incorrect modification counter)
>=20
> First I tried to get the raid going by reconfiguring via
>=20
> raidctl -C /etc/raid0.conf raid0

Ow...  If sd30 was known to have failed 12 hours before sd31, then=20
sd30 should have been removed from the config file before attempting=20
the =22-C=22.  At the point at which sd31 failed, the data on all the=20
disks (sans sd30) would have been self-consistent.  Adding sd30 back=20
into the mix (and claiming the data is =22valid=22 with the use of =22-C=
=22)
is just going to lead to all sorts of problems...=20
=20
> After that the raid came up and a /dev/raid0 was accessible. I had the =
hope t
> hat=20
> the read errors of sd31 would not persist and tried to fail sd30.

The problem is that any data writes in those 12 hours would not be=20
reflected onto sd30, and with =22-C=22 you've told it =22sd30 is fine=22,=
 and=20
the system will be attempting to read data from that component.

> raidctl -F /dev/sd30b raid0
>=20
> This caused a panic since sd31 produced hard errors again.

Right.. :(  (this panic doesn't happen in at least 3.0 and later...)

> _____________________________________________________________
>=20
> Step 2
> ------------------------------------
>=20
> To get the raid going again, I decided to copy sd31 to sd33 (the cold s=
pare).
> =20
> This would allow the raid to come up since there will be no hard errors=
. To c
> opy=20
> I used (all disks are identical)
>=20
> dd if=3D/dev/rsd31c bs=3D1b conv=3Dnoerror,sync of=3D/dev/rsd33c
>=20
> I know, that there will be some blocks with wrong infos in them (dd wil=
l prod
> uce=20
> blocks filled with null bytes on read errors). sd30 remains as failed. =
But sd
> 31=20
> will not produce read errors anymore. Thus the building of the raid wil=
l succ
> eed.

Well.. except you *will* lose data for any of the blocks that had the=20
wrong info... (i.e. zeros instead of other valid data).

> Then I edited /etc/raid0.conf and changed sd31 to sd33 looking as
>=20
> START disks
> /dev/sd10b
> /dev/sd11b
> /dev/sd12b
> /dev/sd30b
> =23 changed sd31 to sd33
> /dev/sd33b
> /dev/sd32b
>=20
> I didn't change the spare line.
>=20
> After a reboot the raid came up correctly and was configured automagica=
lly.=20
> Since all the filesystems that where on the raid were commented out the=
 raid=20
> remained untouched after configuration.
>=20
> raidctl -s /dev/raid0
>=20
> showed
>=20
>             /dev/sd10b: optimal
>             /dev/sd11b: optimal
>             /dev/sd12b: optimal
>             /dev/sd30b: failed
>             /dev/sd31b: optimal
>             /dev/sd32b: optimal
>             spares: no spares
> and
>             Parity status: dirty
>             Reconstruction is 100% complete.
>             Parity Re-write is 100% complete.
>             Copyback is 100% complete.
>=20
>=20
> Two questions: why is sd31 not replaced by sd33?

Using =22dd' to replace a dead disk is a bit tricky... You need to have=20
the RAID set unconfigured, then do the dd, then and then configure (with =
-C)=20
with the a raid.conf file that reflects the new component change.  If=20
you miss this last step, then the autoconfig is going to just pick=20
the set of components that match, and, unfortunately, since both sd31=20
and sd33 have the same component label, it'll just take the first one=20
it finds.

> Why is there no spare?

Spares never get auto-configured.

> Where
>  is=20
> sd13 gone? raidctl -F /dev/sd30b raid0 didn't succeed due to the immedi=
ate pa
> nic=20
> in step 1.
> _____________________________________________________________
>=20
> Step 3
> -------------------------------------------------------------
>=20
> I was sure that sd13 wasn't used so i added sd13 again:
>=20
> raidctl -a /dev/sd13b /dev/raid0
>=20
> Then I initiated reconstruction again
> raidctl -F /dev/sd13b /dev/raid0
                  =5E=5E=5E=5E=5E
You failed the hot-spare?  Or is this a typo?=20

> The system paniced again.

I'm not sure how well RAIDframe in 1.6.2 would take to failing a=20
hot-spare... I could see it might panic... :-/

> _____________________________________________________________
>=20
> Step 4
> -------------------------------------------------------------
> After reboot the system configured the raid. Now I have
>=20
> raidctl -s /dev/raid0
>=20
> /dev/sd10b: optimal
> /dev/sd11b: optimal
> /dev/sd12b: optimal
> /dev/sd13b: optimal
> /dev/sd31b: failed
> /dev/sd32b: optimal
> spares: no spares
>=20
> Where is sd33, why has sd31 failed? sd31 was replaced by sd33 in the co=
nfig f
> ile=20
> and should be optimal.

Except.. the system is still using auto-config???  Autoconfig works=20
really nice, but this sounds like a place where manual-config would=20
have worked better... :(  (when one starts mucking with disks outside=20
of RAIDframe's control, then one needs to really make sure the=20
component labels are set correctly... and for these such occasions,=20
manual config is probably better...)

> Now I tried to check the file system on the raid although the raid is n=
ot=20
> completely functional.
>=20
> fsck -n -f /dev/raid0=7Ba,b,d,e,f=7D
>=20
> Some file systems have more, some less erros. Basically it seems normal=
 from=20
> the=20
> file system point of view. But I don't know what state the raid is in?

I wouldn't trust any data from the RAID set that you can't verify by=20
some other way (e.g. checksums or whatever).  In short, I'd consider=20
the data on the RAID set to be corrupt.
=20
> I'm stuck at this point as to what to do know. I really like to get the=
 raid=20
> going again. I've ordered new drives already. But I'd like to bring the=
 raid=20
> back into a state that will allow for correct operation again without=20
> reconstructing everything from scratch. Yes, I have a backup on tape (a=
lthoug
> h=20
> not the newest one since the last backup on Friday 25th prior to this c=
rash=20
> didn't made it). So the backup is from two weeks ago.

That's about where you'll need to start then... the problem is that=20
forcing a config with -C with both sd30 and sd31 back in step 1, you=20
ended up corrupting the stripes with data that were written to in that
12-hour period.  And there is no way to =22undo=22 that damage....  Had=20
you taken sd30 out, and left sd31 in, at least the data+parity on the=20
=22remaining=22 drives would be self-consistent, and your loss would be=20
limited to the data on stripes where blocks on sd31 are unreadable.

> I see this as a test case for dealing with those errors on raid sets.

Unfortunately, this is also a test case that proves that RAID 5 can=20
only deal with a single disk failures :-=7D  Once you get into two-disk=20
failure land, things get really ugly, and it doesn't matter much if=20
you're using software RAID or hardware RAID.  NetBSD 3.0 and later=20
are more tollerant of the 2nd disk failing, and won't panic the system.=20
(You'll just get an I/O error.)  But even recovering from a 2-disk=20
failure there may not be possible without backups either...=20

> Since this is a file server I have to manage that the system gets up ag=
ain as =20
> quick as possible.

If you have to get going =22today=22, I'd build a brand-new RAID set out =
of sd10b,=20
sd11b, sd12b, sd13b, sd32b, and sd33b (i.e. take sd30 and sd31 out entire=
ly)=20
and restore from backups.

> Thank you all for your input.

Good luck=21

Later...

Greg Oster