Sorry that I'm chiming in a bit late at this point, after Ian
Clark already pointed out what is most likely the culprit...
In your initial config:
START layout
64 1 1 5
this says that the stripe width is 64 blocks... With 2 data blocks and
1 parity block in each stripe, that gives you a total of 128 blocks of
data in a stripe.
When you did this:
# gpt show raid0
...
64 1953546015 1 GPT part - NetBSD UFS/UFS2
You basically aligned the partition on the half-stripe which, I
believe, ends up in having a whole bunch of the filesystem aligned on
half-stripes. E.g. every 64K write you do ends up straddling two stripes,
causing the read-modify-write small-write penalty for every 64K write.
If you re-align that partition to be 128 blocks from the start of the
RAID set, it should perform significantly better.