Re: NetBSD Raid5, slow write speeds, using big disks?!

To: Taylor R Campbell <riastradh%NetBSD.org@localhost>
Subject: Re: NetBSD Raid5, slow write speeds, using big disks?!
From: Robert Elz <kre%munnari.OZ.AU@localhost>
Date: Tue, 23 Jun 2026 19:02:08 +0700

    Date:        Mon, 22 Jun 2026 23:53:56 +0000
    From:        Taylor R Campbell <riastradh%NetBSD.org@localhost>
    Message-ID:  <20260622235357.9295984D2E%mail.netbsd.org@localhost>

  | > # gpt add -t raid -l raid5@wd0 -b $(( 2048 )) -s 15628051053 wd0 ****
  | What was the alignment of this partition?

Really?   -b 2048   (explicitly setting the starting block - though I have
no idea what the shell arith was supposed to be accomplishing there).

That's @1MB which should be aligned enough.

The problem was almost certainly having a 3+1 raid5 with 3x16KB data
stripes (but it could be compounded by poor filesystem alignment in
the raid set).

And FWIW, I run raid5 (two different raidsets) on big discs (the raidsets
are ~ twice as big as that shown here, and can take almost 2 weeks to
init or reconstruct) with entirely acceptable write performance, one
using wd0 drives typically writes at about 100MB/s, the other, on external
USB drives (in one case, connected via one cable, so shared bandwidth)
typically runs about 50MB/s (sometimes between 60 & 70) - those are
sustained write rates, not the initial "write into the UVM" (buffer cache)
startup peaks.

Both sets are 2+1, 2x16KB stripes, and 32KB or 64KB (I forget) file
system blocks (they may be one each way).   One is about 36TB, the other
around 47TB (TiB in each case).   7TB drives are not big (or not any
more, they'd be approaching being called small these days).

  | Does it make a difference if you use `-a 4096' or `-a 1m' with `gpt
  | add'?

Those would truncate a few blocks off the (uselessly odd numbered) size
but not affect performance.

  | zfs will warn if you ...

Not a bad thing to do, and

  | However, I don't think raidframe detects and warns about this case.

It doesn't but nor does it necessarily know, I run another raid (a raid1)
where one of the components is on a cgd (currently, I thought one of its
drives had died, so took 2 smaller ones, joined them together with cgd
- which is MUCH faster than raid0 - and use that as the 2nd half of the
raid1 ... since then I believe the issue might be the controller, not
the drive after all, so that drive might come back, connected differently)
The point there is that I don't think cgd will pass through the underlying
geometry to its client, or really can, given there's no requirement that
the drives making up the cgd set all have the same properties.

  | And perhaps gpt(8) should also use the disk's native alignment as the
  | default alignment instead of 512 bytes.

But having got warn, when it is able (it needs to work on cgd as well...)
that a partition is badly aligned is probably useful (and setting the default
as you indicate, is probably also useful).   When used on a raidframe, it
could warn about filesystems not aligned on stripes (raidframe should probably
give the stripe size as the "native alignment" - while still using 512 bytes
as the addressing factor.)

Another alignment with raidframe (and the partitions made in the raid array)
that is important is to make sure the file system blocks correctly align with
the raid stripes - when I was first setting up the first of those raid5's
above (in its initial incarnation - it has grown into bigger drives since
then) I didn't think of that one, and nicely made the file system blocks
exactly the same size as the raid stripes (32KB above), but then (to try
and steal an extra few blocks), started the filesystem 16KB into the raid
array (easily big enough to avoid 4K underlying sector sizes being an issue).

But then every (and I mean every) file system block write straddled 2 raid
stripes, meaning 2 RMW cycles for every one of them.   That version crawled.

And whatever one believes of LLM messages to the list, what was in the one
in question (apart from the questionable use of TeX fragments, for no reason)
was mostly accurate.   It recommendation not to use WAPBL is reasonable, but
not because of small writes (those are just metadata writes, and would exist
for non-WAPBL filesystems as well - except more of them) as big filesystems
like this mostly (when writing) are writing big files (and performance on
small ones, including directories) mostly doesn't matter, but because WAPBL
can be intolerably slow for very large filesystems when mounting (flushing
the log) - a fsck of the filesystem can be faster, and WABPL's speedups
tend not to matter as much when you're largely writing large files, then
mostly only ever reading them again afterwards.

kre

ps: I have no idea whether performance would be better in a 3+1 raid5
by using a strips size/component that matches the file system block size.
Doesn't sound quite right to me, but I have never tried that.   I'd be
using 2+1 + 1 drive as hot spare, or 2xraid1 (with cgd under it to make
2 bigger "drives" to run the raid on top of, rather than 2xraid1 with
either cgd or raid0 on top, though the differences are probably mostly
illusory).   Either of those provide less space than the 3+1 (obviously).
The 2+1 + spare is probably best for stability, as if a drive dies,
raidframe should just switch to using the spare immediately, so providing
none of the rest die before that has reconstructed, less window for 2
drive death killing the raidset completely.

Follow-Ups:
- Re: NetBSD Raid5, slow write speeds, using big disks?!
  - From: Robert Elz
- Re: NetBSD Raid5, slow write speeds, using big disks?!
  - From: smurfd

References:
- Re: NetBSD Raid5, slow write speeds, using big disks?!
  - From: Taylor R Campbell

Prev by Date: ThinkPad HW
Next by Date: Re: NetBSD Raid5, slow write speeds, using big disks?!
Previous by Thread: Re: NetBSD Raid5, slow write speeds, using big disks?!
Next by Thread: Re: NetBSD Raid5, slow write speeds, using big disks?!
Indexes:

Home | Main Index | Thread Index | Old Index