Re: Problems with raidframe under NetBSD-5.1/i386

To: Greg Oster <oster%cs.usask.ca@localhost>
Subject: Re: Problems with raidframe under NetBSD-5.1/i386
From: buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow)
Date: Thu, 20 Jan 2011 17:28:21 -0800

        hello.  I got side tracked from this problem for a while, but I'm back
to looking at it as I have time.
        I think I may have been barking up the wrong tree with respect to the
problem I'm having reconstructing to raidframe disks with wedges on the
raid sets.  Putting in a little extra info in the error messages yields:
raid2: initiating in-place reconstruction on column 4
raid2: Recon write failed (status 30(0x1e)!
raid2: reconstruction failed.

        If that status number, taken from the second argument of
rf_ReconeWriteDoneProc() is an error from /usr/include/sys/errno.h, then
I'm getting EROFS when I try to reconstruct the disk.  Wouldn't that seem
to imply that raidframe is trying to write over some protected portion of
one of the components, probably the one I can't reconstruct to?
        Each of the components has a BSD disklabel on it, and I know that the
raid set actually begins 64 sectors from the start of the partition in
which the raid set resides.  However, is a similar "back set" done for the
end of the raid?  That is, does the raid set extend all the way to the end
of its partition or does it leave some space at the end for data as well?
        Here's the thought.  I notice when I was reading through the wedge
code, that there's a reference to searching for backup gpt tables and that
one of the backups is stored at the end of the media passed to the wedge
discovery code.  Since the broken component is the last component  in the
raid set, I wonder if the wedge discovery code is marking the sectors
containing the gpt table at the end of the raid set as protected, but for
the disk itself, rather than the raid set?  I want to say that this is only
a theory at the moment, based on a quick diagnostic enhancement to the
error messages, but I can't think of another reason why I'd be getting that
error.
        I'm going to be in and out of the office over the next week, but I'll
try to see if I can capture the block numbers that are attempting to be
written when the error occurs.  I think I can do that with a debug kernel I
have built for the purpose.  Again, this problem exists under 5.0, not just
5.1, so it predates Jed's changes.
        If anyone has any other thoughts as to why I'd be getting EROFS on a
raid component when trying to reconstruct to it, but not when I  create the
raid, I'm all ears.

-thanks
-Brian
On Jan 7,  3:22pm, Brian Buhrow wrote:
} Subject: Re: Problems with raidframe under NetBSD-5.1/i386
}       hello Greg.  Regarding problem 1, the inability to reconstruct disks
} in raid sets with wedges in them, I confess I don't understand the vnode
} stuff entirely, but rf_getdisksize() in rf_netbsdkintf.c looks suspicious
} to me.  I'm a little unclear, but it looks like it tries to get the disk
} size a number of ways, including by checking for a possible wedge on the
} component.  I wonder if that's what's sending the reference count too high?
} -thanks
} -Brian
} 
} On Jan 7,  2:17pm, Greg Oster wrote:
} } Subject: Re: Problems with raidframe under NetBSD-5.1/i386
} } On Fri, 7 Jan 2011 05:34:11 -0800
} } buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow) wrote:
} } 
} } >   hello.  OK.  Still more info.There seem to be two bugs here:
} } > 
} } > 1.  Raid sets with gpt partition tables in the raid set are not able
} } > to reconstruct failed components because, for some reason, the failed
} } > component is still marked open by the system even after the raidframe
} } > code has marked it dead.  Still looking into the fix for that one.
} } 
} } Is this just with autoconfig sets, or with non-autoconfig sets too?
} } When RF marks a disk as 'dead', it only does so internally, and doesn't
} } write anything to the 'dead' disk.  It also doesn't even try to close
} } the disk (maybe it should?).  Where it does try to close the disk is
} } when you do a reconstruct-in-place -- there, it will close the disk
} } before re-opening it... 
} } 
} } rf_netbsdkintf.c:rf_close_component() should take care of closing a
} } component, but does something Special need to be done for wedges there?
} } 
} } > 2.  Raid sets with gpt partition tables on them cannot be
} } > unconfigured and reconfigured without rebooting.  This is because
} } > dkwedge_delall() is not called during the raid shutdown process.  I
} } > have a patch for this issue which seems to work fine.  See the
} } > following output:
} } [snip]
} } > 
} } > Here's the patch.  Note that this is against NetBSD-5.0 sources, but
} } > it should be clean for 5.1, and, i'm guessing, -current as well.
} } 
} } Ah, good!  Thanks for your help with this.   I see Christos has already
} } commited your changes too. (Thanks, Christos!)
} } 
} } Later...
} } 
} } Greg Oster
} >-- End of excerpt from Greg Oster
} 
} 
>-- End of excerpt from Brian Buhrow

Follow-Ups:
- Re: Problems with raidframe under NetBSD-5.1/i386
  - From: Greg Oster

Prev by Date: re: Is there a way to obtain a machine's cache line size?
Next by Date: Re: Bi-arch 64-bit/32-bit bug in *chflags(2) on NetBSD / standardizing *chflags(2)?
Previous by Thread: Re: Problems with raidframe under NetBSD-5.1/i386
Next by Thread: Re: Problems with raidframe under NetBSD-5.1/i386
Indexes:

Home | Main Index | Thread Index | Old Index