netbsd-help: Re: RAIDFrame parity re-write ETA, 12 hours!?

Subject: Re: RAIDFrame parity re-write ETA, 12 hours!?
To: Greg Oster <oster@cs.usask.ca>
From: William Fletcher <wfletcher@omina.co.za>
List: netbsd-help
Date: 12/14/2006 19:47:16
--3MRlEjvj2/M31Nfs
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Thanks a lot Greg for reading this _short_ chapter. I'll just get
a PCI SATA controller that I know is supported in 3.0. Keep my eyes
open for stuff like that in future.

I'm just forwarding this to the help list since it may help somebody
else along the line.

Thanks again.

On Thu, Dec 14, 2006 at 09:00:41AM -0600, Greg Oster wrote:
> William Fletcher writes:
> > Hi,
> >=20
> > I'm sorry to disturb you. But I'm wondering if you could help me.
> > Not sure if you're still maintaining the raidframe stuff in NetBSD, but=
 ok.
>=20
> Yup, I am...(though lack of recent commits might indicate otherwise=20
> :-} )
>=20
> > Basically, is what I'm experiencing normal?
>=20
> I think so... at least, for the hardware in question...
> =20
> > I have a few machines using raidframe, all my machines are checklisted =
and
> > all of them are behaving perfectly, except for two new machines running
> > 3.0 (the rest run 2.*). I have a 3.1 machine that behaves normally with=
 CGD
> > enabled ontop of the raid partitions. No crashes yet, Infact, i'm e-mai=
ling
> > you off aforementioned 3.1 machine.
> >=20
> > This morning one of my clients phones me due to a power cut they had la=
st n=3D
> > ight.
> > The machine was switched on at 8AM and it is now 12PM. According to rai=
dctl=3D
> >  -S raid0,
> > the parity rewrite has 7 hours to go at this point.
>=20
> That does sound incredibly long, but....=20
>=20
> > At my office there is an identically layed out 3.0 machine, I immediate=
ly p=3D
> > ulled the=3D20
> > power cable out to see if it would do the same, before hand I accidenta=
lly
> > brushed past the power button with my chair and the machine switched off
> > about two weeks ago and it took about 10 hours to rewrite the parity,
> > I assumed this was a once off thing, was very busy at the time and didn=
't
> > pay much attention to this pending crisis.
> >=20
> > After power off and on about 30 minutes ago, raidctl -S raid0 informs me
> > I have 11 hours of parity rewriting to go, the machine is barely usable=
 on=3D
> > =3D20
> > the command line, let alone samba services, e-mail, etc.
>=20
> Ug....=20
>=20
> > I find this strange because I have a Pentium 3 (much less powerful) wit=
h the
> > same size drives and a much more pathetic IDE controller and it doesn't=
 even
> > worry about rewriting the parity, it can operate normally while it rewr=
ites.
> > These machines are a lot more beefy.
>=20
> Let me guess... your P3 only has a pair of disks?=20
>=20
> > The exact commands I used to build the raid (my ugly checklist system, =
sorr=3D
> > y):
> [snip -- looks normal]
>=20
> >=20
> > Hardware in both machines is exactly identical:
> > NetBSD 3.0.1_STABLE (SWORDFISH) #0: Tue Nov 21 16:23:50 SAST 2006
> >         ultraviolet@swordfish.omina.co.za:/usr/src/sys/arch/i386/compil=
e/SW=3D
> > ORDFISH
> > total memory =3D3D 1023 MB
> > avail memory =3D3D 999 MB
> > BIOS32 rev. 0 found at 0xf0010
> > mainbus0 (root)
> > cpu0 at mainbus0: (uniprocessor)
> > cpu0: AMD Unknown K7 (Athlon) (686-class), 2199.91 MHz, id 0x40ff2
> > cpu0: features 78bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
> > cpu0: features 78bfbff<PGE,MCA,CMOV,PAT,PSE36,MPC,MMX>
> > cpu0: features 78bfbff<FXSR,SSE,SSE2>
> > cpu0: features2 2001<SSE3>
> > cpu0: "AMD Athlon(tm) 64 Processor 3500+"
> > pci0 at mainbus0 bus 0: configuration mode 1
> > pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
> [snip]
> > pciide0 at pci0 dev 15 function 0
> > pciide0: vendor 0x1106 product 0x0591 (rev. 0x80)
> > pciide0: bus-master DMA support present, but unused (no driver support)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> There is one issue...=20
>=20
> > pciide0: primary channel configured to native-PCI mode
> > pciide0: using irq 10 for native-PCI interrupt
> > atabus0 at pciide0 channel 0
> > pciide0: secondary channel configured to native-PCI mode
> > atabus1 at pciide0 channel 1
> > pciide1 at pci0 dev 15 function 1
> > pciide1: vendor 0x1106 product 0x0571 (rev. 0x07)
> > pciide1: bus-master DMA support present, but unused (no driver support)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> ditto for here...
>=20
> > pciide1: primary channel configured to compatibility mode
> > pciide1: primary channel interrupting at irq 14
> > atabus2 at pciide1 channel 0
> > pciide1: secondary channel configured to compatibility mode
> > pciide1: secondary channel ignored (not responding; disabled or no driv=
es?)
> [snip]
> > Kernelized RAIDframe activated
> > wd0 at atabus0 drive 0: <HDS728080PLA380>
> > wd0: drive supports 16-sector PIO transfers, LBA48 addressing
> > wd0: 78533 MB, 159560 cyl, 16 head, 63 sec, 512 bytes/sect x 160836480 =
sect=3D
> > ors
> > wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
> > wd1 at atabus1 drive 0: <HDS728080PLA380>
> > wd1: drive supports 16-sector PIO transfers, LBA48 addressing
> > wd1: 78533 MB, 159560 cyl, 16 head, 63 sec, 512 bytes/sect x 160836480 =
sect=3D
> > ors
> > wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
> > wd2 at atabus2 drive 1: <ST3160812A>
> > wd2: drive supports 16-sector PIO transfers, LBA48 addressing
> > wd2: 149 GB, 310101 cyl, 16 head, 63 sec, 512 bytes/sect x 312581808 se=
ctors
> > wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
> > raid0: RAID Level 1
> > raid0: Components: /dev/wd0a /dev/wd1a
> > raid0: Total Sectors: 160836352 (78533 MB)
> > boot device: raid0
> > root on raid0a dumps on raid0b
> > root file system type: ffs
> [snip]
>=20
> So.... It looks like for this machine that the driver for the pciide=20
> doesn't do DMA... and that's really going to slow disk IO down.. :(=20
> For a RAID 1 "parity rewrite", what it's basically doing is reading=20
> both disks and then comparing the bits.  The usual case is that the=20
> bits are the same, in which case it goes on to read the next bits. =20
> In the event they are different, then it needs to basically pick one=20
> that will be the "master", and make the "slave" match the "master".
> So, in general, it's the raw read speed of the disks which is going=20
> to be the performance bottleneck on a parity check.  My guess is that=20
> your machine doesn't have the same pciide controller, and is actually=20
> operating in full DMA mode.  Those 73GB disks should normally get=20
> checked at about 20-50MB/sec (depends on the disk), which should take=20
> at most about an hour....  The times you're indicating means the=20
> disks are only doing about 1.8MB/sec -- a far cry from what they=20
> should be doing...=20
>=20
> On the plus side, it looks like 4.0 will have better support for the
> ide controller:
>=20
>  pciide0: vendor 0x1106 product 0x0591 (rev. 0x80)
>=20
> 1.842        (bouyer   25-Oct-06): #define      PCI_PRODUCT_VIATECH_VT823=
7A_SATA        0x0591          /* VT8237A Integrated SATA Controller */
>=20
> (it came in on Oct 26).  I don't know how easy/hard it would be to=20
> pullup support for this to netbsd-3.  Manuel Bouyer would be the=20
> person to ask about that...=20
>=20
> I hope this helps some...
>=20
> Later...
>=20
> Greg Oster
>=20
>=20
>=20
>=20

--=20
Omina Solutions  | http://omina.co.za | (012) Ph. 664-2480 F. 664-2474=20


--3MRlEjvj2/M31Nfs
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (NetBSD)

iD8DBQFFgY4k0o1hk/SHCkoRAk7wAJ0ek0+U0QSULazi7CoN2XZMtrwyvACfUoHR
42zQyOz4k6teeX22aCjdgPI=
=MxUc
-----END PGP SIGNATURE-----

--3MRlEjvj2/M31Nfs--