Subject: Re: Bad sectors vs RAIDframe
To: None <tls@rek.tjls.com>
From: Charles Swiger <cswiger@mac.com>
List: netbsd-users
Date: 06/06/2005 16:36:25
On Jun 6, 2005, at 1:06 PM, Thor Lancelot Simon wrote:
> On Mon, Jun 06, 2005 at 11:44:51AM -0500, J Chapman Flack wrote:
>> Thor Lancelot Simon wrote:
>>> Most IDE drives only spare out sectors on *write* (one must ask:  
>>> what,
>>> exactly, could they do to avoid presenting a read error on read  
>>> -- and
>>
>> If the question wasn't rhetorical, I think the answer is, it's a  
>> "read error"
>> if the drive had to apply ECC to recover the correct data; then it  
>> reassigns
>> the block, writes the recovered data to the new block, and returns  
>> the
>> recovered data to the host.

Agreed, the drive should be able notice a sector failing when it  
reads the data.

> Right, so, there are two problems here.
>
> First, even if some errors are correctable with ECC, some aren't.   
> Is it
> correct for the drive to automatically spare out on an _uncorrectable_
> error?

Maybe.  While you can't be sure that you've gotten correct data back  
when ECC isn't enough to correct the problem, you might still get  
lucky, or get data that is partially valid and partially corrupt,  
which is better than nothing for most people (ie, if they are not  
using RAID).

If you migrate the problem sector to a spare regardless, at least you  
can write to the problem sector and have that work again.  If you  
don't migrate a potentially failing sector, what happens if the drive  
doesn't notice that what you've written can't be read while you've  
still got the actual data handy?

The question is what should be done to the newly allocated spare  
sector if the original sector can't be fully corrected.

> If it does so, and the host retries the read, it will get back
> a block full of zeroes -- which will cause a particularly ugly kind of
> data corruption in a parity RAID setup.

Yes, I can see that.  If the drive allocates a blank spare sector  
upon a read error, and a retry then gives that back without  
indicating an error, that's going to cause big problems.

-- 
-Chuck