netbsd-bugs: kern/32018: raidframe reconstruction will panic when new component fails

Subject: kern/32018: raidframe reconstruction will panic when new component fails
To: None <kern-bug-people@netbsd.org, gnats-admin@netbsd.org,>
From: Wolfgang Stukenbrock <Wolfgang.Stukenbrock@nagler-company.com>
List: netbsd-bugs
Date: 11/08/2005 14:04:00

>Number:         32018
>Category:       kern
>Synopsis:       raidframe reconstruction will panic when new component fails
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Tue Nov 08 14:04:00 +0000 2005
>Originator:     Wolfgang Stukenbrock
>Release:        NetBSD 2.0.2
>Organization:
Dr. Nagler & Comapny GmbH
	
>Environment:
	
	
System: NetBSD s011 2.0.2 NetBSD 2.0.2 (NSW-Webproxy) #10: Mon Jun 13 14:14:26 CEST 2005 wgstuken@s012:/export/netbsd-2.0.2/usr/src/sys/arch/i386/compile/NSW-Webproxy i386
Architecture: i386
Machine: i386
>Description:
	A panic will occure while reconstructing (e.g. a mirror) if there is a problem
	writing some blocks to the new device.
	A message "raid0: Recon write failed" is printed, followed by a panic in line
	880 of rf_reconstruct.c.

	This is a very bad behaviour. If such error occurs, the new component should be
	set to failed and the reconstruction at all should fail.
	There is no need to kill a running server if a reconstruction failed. The previous
	state of the raid-device (in degraded mode) is still there.

	The problem is located in dev/raidframe/rf_reconstruct.c.
	At line 872 there is the label RF_REVENT_WRITE_FAILED of the event-processing
	stuff and this is a fall through into the panic at line 880.
	The event RF_REVENT_WRITE_FAILED is set at line 1290 in ReconWriteDoneProc() in
	dev/raidframe/rf_reconstruct.c. This is the one and only place where this event
	is triggered.
>How-To-Repeat:
	This is a little bit complicate, because you need a disk that will fail to write
	some blocks. If you have such disk, just setup a raiddevice (e.g. a mirror) fail
	one component and start reconstruction onto the disk with the write problem.
	If the write-failed-blocks are reached, the system will panic.
>Fix:
	Add code to the event processing part (around line 872 in rf_reconstruct.c) that
	will abort the reconstruction and set the new component to failed.

	PS. perhaps something equivalent should be added to read-errors. In this case,
	the reconstruction has failed and at least another component of the raid-device
	has gone (-> status = failed). I don't know if this read-error will be already
	handled somewhere else.

	I've not the time to completly understand the whole raidframe stuff, so I cannot
	provide some code that will fix this problem. Sorry.

>Unformatted: