NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Beating a dead horse



    Date:        Tue, 24 Nov 2015 21:57:50 -0553.75
    From:        "William A. Mahaffey III" <wam%hiwaay.net@localhost>
    Message-ID:  <56553074.9060304%hiwaay.net@localhost>


  | 4256EE1 # time dd if=/dev/zero of=/home/testfile bs=16k count=32768
  | 32768+0 records in
  | 32768+0 records out
  | 536870912 bytes transferred in 22.475 secs (23887471 bytes/sec)
  |         23.28 real         0.10 user         2.38 sys
  | 4256EE1 #
  | 
  | i.e. about 24 MB/s.

I think I'd be happy enough with that, maybe it can be improved a little.

  | When I zero-out parts of these drive to reinitialize 
  | them, I see ~120 MB/s for one drive.

Depending upon just how big those "parts" are, that number might be
an illusion.    You need to be writing at least about as much as
you did in the test above to reduce the effects of write behind
(caching in the drive) etc.   Normally a "zero to reinit" write
doesn't need nearly that much (often just a few MB) - writing that
much would be just to the drive's cache, and measuring that speed
is just measuring DMA rate, and useless for anything.


  | RAID5 stripes I/O onto the data 
  | drives, so I expect ~4X I/O speed w/ 4 data drives. With various 
  | overheads/inefficiencies, I (think I) expect 350-400 MB/s writes.

That's not going to happen.   Every raid write (whatever raid level, except 0)
requires 2 parallel disc writes (at least) - you need that to get the
redundancy that is the R in the name - it can also require reads.

For raid 5, you write to the data drive (one of the 4 of them) and to the
parity drive - that is, all writes end up having a write to the parity
drive, so the upper limit on speed for a contiguous write is that of one
drive (a bit less probably, depending upon which controllers are in use,
as the data still needs to be transmitted twice, and each controller can
only be transferring for one drive at a time .. at least for the kinds of
disc controllers in consumer grade equipment.)   If both data and parity
happen to be using the same controller, the max rate will certainly be
less (measurably less, though perhaps not dramatically) than what you can
achieve with one drive.  If they're on different controllers, then in
ideal circumstances you might get close to the rate you can expect from
one drive.

For general purpose I/O (writes all over the filesystem, as you'd see in
normal operations) that's mitigated by there not really being one parity
drive, rather, all 5 drives (the 4 you think of as being the data drives,
and the one you think of as being the parity drive) perform as both data
and parity drives, for different segments of the raid, so there isn't
really (in normal operation) a one drive bottleneck -- but with 5
drives, and 2 writes needed for each I/O, the best you could possibly
do is twice as fast as a single drive in overall throughput.  In practice
you'd never see that however, real workloads just aren't going to be
just that conveniently spread out in just the right parts of the filesystems,
if you ever get faster than a single drive can achieve, I'd be surprised.
If you ever even approach what a single drive can achieve, I'd be surprised.

Now, in the above (aside from the possible measurement error in your 120MB/s)
I've been allowing you to think that's "what a single drive can achieve".
It isn't.  That's raw I/O onto the drive, and will run at the best possible
speed that the drive can handle - everything is optimised for that case, as
it is one of the (meaningless) standard benchmarks.

For real use, there are also filesystem overheads to consider, your raid
test was onto a file, on the raid, not onto the raw raid (though I wouldn't
expect that to be all that much faster, certainly not more than about 60MB/s
assuming the 120 MB/s is correct).

To get a more valid baseline, what you can actually expect to observe,
you need to be comparing apples to apples - that is, the one drive test
needs to also have a filesystem, and you need to be writing a file to it.

To test that, take your hot spare (ie: unused) drive, and build a ffs on
it instead of raid (unfortunately, I think you need to reboot to get it
out of being a raidframe spare first - as I recall, raidframe has no
"stop being a spare" operation ... it should have, but ...).  Just stop
it being added back as a hot spare (assuming you are actually doing that now,
spares don't get autoconfigured.)   (Wait till down below to see how to
do the raidctl -s to see if the hot spare is actually configured or not,
raidctl will tell you, once you run it properly).

Then build a ffs on the spare drive (after it is no longer a spare)
(you'll need to change the partition type from raid to ffs in the label
first - probably using disklabel, though it could be gpt)

Setup the ffs with the same parameters as the filesystem on
your raid (ie: -b 32768 -f 4096), mount that, copy a bunch of files
onto it (make its %'ge full be about the same as whatever is on the
raid filesystem you're testing - that's about 37% full from your df
output later) and then try a dd like above onto that and see how fast
that goes.   I can promise you it won't be nearly 120MB/s...

The filesystem needs data on it (not just empty) so that the block allocation
strategy works about the same - writing to an empty filesystem would make
it too simple to always pick the best block for the next write, and so
make the speed seem faster than what is reasonable in real life.

Once you've done that and obtained the results, you can just unmount the
dummy filesystem, change the partition type back to raid in the label,
and just add it back as a raidframe spare (no need to reboot or anything
to go that direction, and no need to do anything to the drive other than
change the partition label type).

Once you've done that, you can properly compare the raid filesystem
performance, and the single disc filesystem performamce, and unless the
raid performance is less than half what you get from the single disc
filesystem, then I'd just say "OK, that's good" and be done.

If the single disc filesysem is hugely faster than than that 24MB/sec
(say 60MB/sec or faster), which I kind of doubt it will be, then perhaps
you should look at tuning the raid or filesystem params.

Until then, leave it alone.

  | I posted a variation of this question a while back, w/ larger amount of 
  | I/O, & someone else replied that they tried the same command & saw ~20X 
  | faster I/O than mine reported.

There are too many variables that could cause that kind of thing, different
drive types, filesystem params, ...

A question to ask yourself is just what you plan on doing that is going
to need more than 24MB/sec sustained write throughput ?

Unless your application is something like video editing, which produces
lots of data very quickly (and if it is, raid5 is absolutely not what you
should be using ... use raid10 instead .. you'll get less space, but must
faster writes (and faster reads))

For most normal software development however, you'll never come close to
that, when my systems are ultra busy, I see more like 4-5 MB/sec sustained
in overall I/O (in and out combined).  Just occasional bursts above that.


  | ffs data from dumpfs for that FS (RAID5 mounted as /home):
  |
  | 4256EE1 # cat dumpfs.OUTPUT.head.txt
  | file system: /dev/rdk0
  | format  FFSv2
  | endian  little-endian
  | location 65536  (-b 128)

That looks good.

  | bsize   32768   shift   15      mask    0xffff8000
  | fsize   4096    shift   12      mask    0xfffff000

And those look to be appropriate.   I deleted all the rest, none of it
is immediately relevant.


  | 4256EE1 # raidctl -s dk0
  | raidctl: ioctl (RAIDFRAME_GET_INFO) failed: Inappropriate ioctl for device

That makes no sense, it isn't supposed to work...

  | 4256EE1 # raidctl -s raid0a
  | Components:
  |             /dev/wd0a: optimal
  |             /dev/wd1a: optimal

This is a raid1, and obviously isn't what you're talking about.

(And normally you wouldn't give "raid0a" there, just "raid0", I'm
actually a little surprised that raid0a worked, If you wanted to
be more explicit, I'd have expected /dev/raid0d to be the device
name it really wants.)

There must be a raid1a (or something) that has /home mounted on it,
right, try raidctl -s raid1   [but not that, see below, probably raid2,
I was replying without reading to the end first ... stupid me!]

(and yes, I know it is confusing that "raid1" sometimes means "RAID Level 1"
and sometimes means "the second raidframe container (disk like thing)")
In commands like raidctl it is always the second though...

  | 4256EE1 # df -h
  | Filesystem         Size       Used      Avail %Cap Mounted on
  | /dev/raid0a         16G       210M        15G   1% /
  | /dev/raid1a         63G       1.1G        59G   1% /usr
  | /dev/dk0           3.5T       1.2T       2.1T  37% /home
  | kernfs             1.0K       1.0K         0B 100% /kern
  | ptyfs              1.0K       1.0K         0B 100% /dev/pts
  | procfs             4.0K       4.0K         0B 100% /proc
  | tmpfs              8.0G       4.0K       8.0G   0% /tmp

Oh, I where the confusion comes from, dk0 is a wedge, probably on raid2

Do
	sysctl hw.disknames
that will list all of the "disk type" devices in the system.
That will include wd0 ... wd5, raid0, raid1, and (I expect) raid2
(as well as perhaps a bunch of dkN wedge things).

You want the raidctl -s output from the raidN that is not raid0 or raid1

You can also look in /var/log/messages (or /var/run/dmesg.boot) and see
the boot time message that will tell you where dk0 comes from, or do

	dkctl dk0 getwedgeinfo

which will print something like

dk0 at sd0: Passport_EFI
dk0: 262144 blocks at 512, type: msdos

except in your case the "sd0" will be "raidN", the Passport_EFI will be
whatever you called the wedge (its label) when it was created, and the
sizes and filesystem types will obviously be different.  The "raidN" part
is all that matters here.

The raidN is what you want for raidctl -s

  | Because of its size (> 2 TB) it was setup using dkctl & raidframe won't 
  | report anything about it, how can I get that info for you ? Thanks & TIA.

See above...

Once you have all the relevant numbers, it will probably take a raidframe
and filesystem expert to tell you whether your layout is optimal or not.
I am neither.

kre



Home | Main Index | Thread Index | Old Index