Re: Problem with raidframe under NetBSD-3 and NetBSD-4

To: buhrow%lothlorien.nfbcal.org@localhost (Brian Buhrow)
Subject: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
From: Greg Oster <oster%cs.usask.ca@localhost>
Date: Fri, 04 Apr 2008 13:48:26 -0600

Brian Buhrow writes:
>       Hello Greg.  I have two machines with mirored 1TB disks installed.
> Here is the output from dmesg concerning the raid on the working machine.
> Note that I've not actually tried rebuilding a replacement disk on this
> machine, I just built thins with the disks initially.

Ahh.. so that means a reconstruct has not been tried on the other 
machine's disks, and that the parity check was the only code 
exercised there...

>       The raid was built under NetBSD-3.1, but is now running 4.0-stable.
> 
> Good machine:
> raid0: RAID Level 1
> raid0: Components: /dev/wd0a /dev/wd1a
> raid0: Total Sectors: 1953524992 (953869 MB)
> boot device: raid0
> root on raid0a dumps on raid0b
> root file system type: ffs
> raid0: Device already configured!
> 
>       Now, contrast that with the broken machine:
> 
> raid0: RAID Level 1
> raid0: Components: component0[**FAILED**] /dev/wd0a
> raid0: Total Sectors: 1953524992 (953869 MB)
> boot device: raid0
> root on raid0a dumps on raid0b
> root file system type: ffs
>       Unfortunately, I don't have what I think you want to see.  That is, I
> don't have the messages that came from the kernel after I initiated the
> copyback but before I rebooted the machine.  

I think we're going to need that info (plus possibly additional 
debugging info) to really figure out what's going on....

> You're correct that  the
> machine doesn't reboot, but begins acting strangely, as all the processes
> line up to get stuff off the disks, which no longer return data.
> Eventually, all processes end up in disk wait.
>       I was wondering about the size of these individual disks, and
> wondering if I was running into some size limitation, since it seems like wha
> t's
> happening, as I read the source, is that the reconstruction starts, but
> then fails silently for some reason.  Or, atleast, I can't see the failure
> message because dmesg stops returning messages shortly after this process
> begins.  Worsse, after the failure occurs, the raid system fails to unlock
> the raid set, meaning no other processes can get to it. 

Well.. the RAID set has to suspend new requests and wait till 
outstanding IOs complete before beginning the reconstruction... but 
that shouldn't take very long... 

> Unless I'm
> mistaken, a very likely possibility since I'm not familiar with this code,
> it looks to me like there are several places in the raidioctl() function in
> rf_netbsdkintf.c where a malloc occurs, and an error is returned if the
> malloc fails without unlocking the raid mutex.

Do you have line numbers and a revision of rf_netbsdkintf.c for that? 
I'm not seeing any instances where that would be true... (at least 
not in -current or 4.0)  (Note that it only needs to unlock the 
raidPtr->mutex if it actually locked it -- and the mutex doesn't get 
locked unless it needs to be...)

> If that's true in
> raidioctl, I wonder if it could be true in other places in the code?
>       Another question, has anyone else set up raid sets with disks this
> large?

Not me... I've got plenty of dual 320's, but nothing larger than that.
 
>       Also, here is the disk label for the disks in the raid set in
> question.


Later...

Greg Oster

References:
- Re: Problem with raidframe under NetBSD-3 and NetBSD-4
  - From: Brian Buhrow

Prev by Date: Re: kern/38019: some kind of undetected deadlock slowly kills NetBSD-4.0_STABLE GENERIC.MP
Next by Date: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
Previous by Thread: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
Next by Thread: Re: Problem with raidframe under NetBSD-3 and NetBSD-4
Indexes:

Home | Main Index | Thread Index | Old Index