tech-kern archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Problems with raidframe under NetBSD-5.1/i386



On Thu, 20 Jan 2011 17:28:21 -0800
buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow) wrote:

>       hello.  I got side tracked from this problem for a while, but
> I'm back to looking at it as I have time.
>       I think I may have been barking up the wrong tree with
> respect to the problem I'm having reconstructing to raidframe disks
> with wedges on the raid sets.  Putting in a little extra info in the
> error messages yields: raid2: initiating in-place reconstruction on
> column 4 raid2: Recon write failed (status 30(0x1e)!
> raid2: reconstruction failed.
> 
>       If that status number, taken from the second argument of
> rf_ReconeWriteDoneProc() is an error from /usr/include/sys/errno.h,
> then I'm getting EROFS when I try to reconstruct the disk.  

Hmmm... strange...

> Wouldn't
> that seem to imply that raidframe is trying to write over some
> protected portion of one of the components, probably the one I can't
> reconstruct to? Each of the components has a BSD disklabel on it, and
> I know that the raid set actually begins 64 sectors from the start of
> the partition in which the raid set resides.  However, is a similar
> "back set" done for the end of the raid?  That is, does the raid set
> extend all the way to the end of its partition or does it leave some
> space at the end for data as well? 

No, it doesn't.  The RAID set will use the remainder of the component,
but up to a multiple of whatever the stripe width is... (that is, the
RAID set will always end on a complete stripe.)

> Here's the thought.  I notice when
> I was reading through the wedge code, that there's a reference to
> searching for backup gpt tables and that one of the backups is stored
> at the end of the media passed to the wedge discovery code.  Since
> the broken component is the last component  in the raid set, I wonder
> if the wedge discovery code is marking the sectors containing the gpt
> table at the end of the raid set as protected, but for the disk
> itself, rather than the raid set?  I want to say that this is only a
> theory at the moment, based on a quick diagnostic enhancement to the
> error messages, but I can't think of another reason why I'd be
> getting that error. I'm going to be in and out of the office over the
> next week, but I'll try to see if I can capture the block numbers
> that are attempting to be written when the error occurs.  I think I
> can do that with a debug kernel I have built for the purpose.  Again,
> this problem exists under 5.0, not just 5.1, so it predates Jed's
> changes. If anyone has any other thoughts as to why I'd be getting
> EROFS on a raid component when trying to reconstruct to it, but not
> when I  create the raid, I'm all ears.

So when one builds a regular filesystem on a wedge, do they end up with
the same problem with 'data' at the end of the wedge?  If one does a dd
to the wedge, does it report write errors before the end of the wedge?

I really need to get my test box up-to-speed again, but that's going to
have to wait a few more weeks...

Later...

Greg Oster


> On Jan 7,  3:22pm, Brian Buhrow wrote:
> } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
> }     hello Greg.  Regarding problem 1, the inability to
> reconstruct disks } in raid sets with wedges in them, I confess I
> don't understand the vnode } stuff entirely, but rf_getdisksize() in
> rf_netbsdkintf.c looks suspicious } to me.  I'm a little unclear, but
> it looks like it tries to get the disk } size a number of ways,
> including by checking for a possible wedge on the } component.  I
> wonder if that's what's sending the reference count too high? }
> -thanks } -Brian
> } 
> } On Jan 7,  2:17pm, Greg Oster wrote:
> } } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
> } } On Fri, 7 Jan 2011 05:34:11 -0800
> } } buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow) wrote:
> } } 
> } } >         hello.  OK.  Still more info.There seem to be two bugs
> here: } } > 
> } } > 1.  Raid sets with gpt partition tables in the raid set are not
> able } } > to reconstruct failed components because, for some reason,
> the failed } } > component is still marked open by the system even
> after the raidframe } } > code has marked it dead.  Still looking
> into the fix for that one. } } 
> } } Is this just with autoconfig sets, or with non-autoconfig sets
> too? } } When RF marks a disk as 'dead', it only does so internally,
> and doesn't } } write anything to the 'dead' disk.  It also doesn't
> even try to close } } the disk (maybe it should?).  Where it does try
> to close the disk is } } when you do a reconstruct-in-place -- there,
> it will close the disk } } before re-opening it... 
> } } 
> } } rf_netbsdkintf.c:rf_close_component() should take care of closing
> a } } component, but does something Special need to be done for
> wedges there? } } 
> } } > 2.  Raid sets with gpt partition tables on them cannot be
> } } > unconfigured and reconfigured without rebooting.  This is
> because } } > dkwedge_delall() is not called during the raid shutdown
> process.  I } } > have a patch for this issue which seems to work
> fine.  See the } } > following output:
> } } [snip]
> } } > 
> } } > Here's the patch.  Note that this is against NetBSD-5.0
> sources, but } } > it should be clean for 5.1, and, i'm guessing,
> -current as well. } } 
> } } Ah, good!  Thanks for your help with this.   I see Christos has
> already } } commited your changes too. (Thanks, Christos!)
> } } 
> } } Later...
> } } 
> } } Greg Oster
> } >-- End of excerpt from Greg Oster
> } 
> } 
> >-- End of excerpt from Brian Buhrow
> 


Later...

Greg Oster


Home | Main Index | Thread Index | Old Index