Re: WAPL/RAIDframe performance problems

To: Edgar Fuß <ef%math.uni-bonn.de@localhost>, tech-kern%NetBSD.org@localhost, tech-perform%NetBSD.org@localhost
Subject: Re: WAPL/RAIDframe performance problems
From: Brian Buhrow <buhrow%nfbcal.org@localhost>
Date: Sat, 10 Nov 2012 10:29:44 -0800
        hello.  Your tests and data have inspired me to think about and play
with different spsu sizes under raidframe because I too am experiencing
some possible performance issues with the mpt(4) driver.  My testing has
lead to some questions, and, possibly in your case Edgar, some avenues that
you can pursue.  It may be that I'm behind you in this process and that
you've already thought of some of the issues I lay out below.  I don't have
all the answers, but perhaps these  ramblings will help move you forward.

1.  I've discovered that the raid cards the mpt(4) driver  controls are
complicated beasts with very large firmwares inside them which do a lot of
magic.  I think it's possible, even if you have the raid card in jbod mode,
that you could be writing with sub-optimal block sizes to the raid card
itself, as opposed to raidframe at the OS level.  For example, I have two
production boxes with large amounts of disk on raidframe.  One attaches the
SATA disks to 4-port Promise cards controlled by the pdcsata(4) driver.  The
other attaches the disks to 3Ware Escalade cards controlled by the twa(4)
driver.  The Promise cards give roughly twice the throughput for the
same work load and configuration than do the 3Ware cards.  Is it 
possible for you to set up a test with raid5 using a different sata or sas 
interface card?
        Also, I have a machine I'm qualifying right now using the mpt(4) with
the dual-port LSI Fusion 1030 card.   I've found that for high work loads
this card just stops generating interrupts for no apparent reason.  I
wonder if you're suffering from a similar, but not quite so fatal, kind of
issue with your raid controller?  (just to reinforce how complex these cards
really are.)

2.  As I was playing with different spsu sizes, it occurred to me that I
still  might not be getting optimal performance through raidframe because
while the spsu size was right, the sector boundaries which defined the
stripe might not be the same as the blocks being read from and written to
on the filesystem itself.  In other words, when a stripe size and
filesystem block size are calculated, I think it's important to  count up
exactly where the active blocks that will be in use during the life of the
filesystem will land in your raid set and adjust the partition sizes
accordingly.  For example, /usr/include/ufs/ffs/fs.h suggests that the
super block could bein one of 4 different places on your partition,
depending on what size your disk is, and what version of superblock you're
using.  what I think this means is that if you want to arrange for 
optimal performance of a filesystem on a raid5 set, you need to ensure that 
the superblock for that filesystem starts on the beginning of a 
stripe for the raid5 set.  If you're using a BSD disklabel and an ffsV1
superblock, the superblock starts 8192 bytes into the partition.  If it's
an ffsV2 superblock, then it could be at 65536 bytes from the beginning of
the partition.  There is even a possibility the superblock could be  at 256
kbytes from the beginning of a partition, though I'm not certain I know
when that's true.   I want to reiterate that I'm no expert here, but unless
I'm not understanding things at all, I think this means that starting the
first data partition at sector 0 of the raid set is almost certain to
guarantee that you're going to get rmw write cycles when writing to a filesystem
placed in this manner on a raid set evenif the spsu is correct.  This is why,
I think, you're not seeing any appreciable improvement in write speed on
your raid 5 set and why raid5 is so much slower than raid1.  (raid5 will
always be slower than raid1, but how much slower is, I think, what's at
issue here.)
        I've not had a chance to play with all my ideas here in my testing,
so I can't say for sure if my theory makes a huge difference in practice, but
it's something I plan to try and it may be worth investigating in your
case, especially since it severely impacts your work efficiency.  And, as I
said earlier, perhaps you've thought of all this and have made the necessary
adjustments to your partition tables and I'm preaching to the choir.


        I hope this helps you some, and gives you some paths forward.  I kno
how frustrating it can be to try and figure out why something isn't working
as expected.

-Brian
On Nov 10,  5:46pm, Edgar =?iso-8859-1?B?RnXf?= wrote:
} Subject: WAPL/RAIDframe performance problems
} So apart from the WAPL panic (which I'm currently unable to reproduce), I 
seem 
} to be facing two problems:
} 
} 1. A certain svn update command is ridicously slow on my to-be file server.
} 2. During the svn update, the machine partially locks up and fails to respond 
}    to NFS requests.
} 
} There is little I feel I can analyze (2). I posted a bunch of crash(8) 
traces, 
} but that doesn't seem to help.
} 
} There seem to be two routes I can persue analyzing (1):
} A. Given the svn update is fast on a single-disc setup and slow on the file 
}    server having a RAID, larger blocks and what else, find what the 
significant 
}    difference between the two setups is.
} B. Given the only unusual thing that svn update command does is creating a 
}    bunch of .lock files, a lot of stat()'s and then unlinking the .lock files,
}    find some simple commands to enable others to reproduce the problem.
} 
} Regarding (B), a simple command somewhat mimicking the troublesome svn update 
} seems to be (after mkdir'ing the 3000 dirs)
}       time sh -c 'for i in $(seq 1 3000); do touch $i/x; done; for i in $(seq 
1 3000); do rm $i/x; done'
} 
} Regarding (A), there seems to be no single difference explaining the whole 
} performance degradation, so I tried to test intermediate steps.
} 
} We start with a single SATA disc on a recent 5.1 system.
} With WAPL, the svn update takes 5s to 7s (depending on the FFS version and 
the 
} fsbsize) while the touch/rm dance takes 4s.
} Disabling WAPL makes the svn update take 5s (i.e. better or no worse than 
with 
} WAPL enabled), while the touch/rm slows down to almost 14s.
} Enabling soft updates, the svn update finishes in 1,25s, the touch/rm in 4s.
} Write speed (dd) on the file system is 95MB/s.
} - So the initial data point is 5s for svn and 4s for the substitute for a 
}   95MB/s file system write throughput.
} - We also note that softdep outperforms WAPL by a factor of 4 for the svn 
}   command and plain FFS performs no worse that WAPL.
} 
} We now move to a plain mpt(4) 7200rpm SAS disc (HUS723030ALS640, if anyone 
} cares) on the 6.0 system.
} Without WAPL, the svn update takes (on different FFS versions and fsbsizes) 
} 5s to 7s. The touch/rm takes 9,5 to 19s.
} With WAPL, svn takes 9s to 13s and touch/rm 8 to 9,5s.
} No softdeps on 6.0 to try.
} Write speed to fs is 85MB/s.
} So we have:
} - without WAPL, both "the real thing" and the substitute are roughly as fast 
}   as on the SATA system (which has slightly higher fs write throughput).
} - with WAPL, both commands are significantly slower that on the SATA box.
} 
} Now to a two-component Level 1 RAID on two of these discs. We chose an SpSU 
} value of 32 and a matching fsbsize of 16k.
} The svn update takes 13s with WAPL and just under 6s without.
} The touch..rm test takes 22s with WAPL and 19s without.
} Write speed is at 18MB/s, read at 80MB/s
} So on the RAID 1:
} - Without WAPL, things are roughly as fast as on the plain disc.
} - With WAPL, both svn and the test are slower than without (with the real 
thing 
}   worse than the substitute)!
} - Read speed is as expected, while writing is four times slower than I would 
}   expect given the optimal (for writing) fsbsize equals stripe size relation.
} 
} Next a five-component Level 5 RAID. Again, an SpSU of 8 matches the fsbsize
} of 16k.
} Here, the update takes 56s with WAPL and just 31 without.
} The touch..rm test takes 1:50 with WAPL and 0:56 without.
} Write speed on the fs is 25MB/s, read speed 190MB/s
} So on the RAID 5:
} - Both the "real thing" and the substitute are significantly (about a factor 
of
}   five) slower than on the RAID 1 although the RAID's stripe size matches the 
}   file system block size and we should have no RMW cycles.
} - OTOH, write and read speeds are faster than on RAID 1; still, writing is 
}   much, much slower than reading (again, with an SpSU otimized for writing).
} - Enabling WAPL _slows down_ things by a factor of two.
} 
} Simultaneously quadrupling both SpSU and fsbsize (to 32 and 64k) doesn't 
change 
} much on that.
} 
} But last, on a Level 5 RAID with 128SpSU and 64k fsbsize (i.e., one file 
system 
} block per stripe unit, not per stripe):
} The svn update takes 36s without WAPL and just 1,7s with WAPL, but seven 
} seconds later, the discs are 100% busy for another 33s. So, in fact, it takes 
} 38s until the operation really finishes.
} 
} 
} Now, that's a lot of data (in fact, about one week of analysis).
} Can anyone make sense out of it? Especially:
} - Why is writing to the RAID so slow even with no RMWs?
} - Why does WAPL slow down things?
} 
} 
} If it wasn't for softdep-related panics i suffered on the old (active, 4.0) 
} file server no-one was able to fix, I would simply cry I wanted my softdeps 
} back. As it is, I need help.
>-- End of excerpt from Edgar =?iso-8859-1?B?RnXf?=
Follow-Ups:
- Re: WAPL/RAIDframe performance problems
  - From: Edgar Fuß
- Re: WAPL/RAIDframe performance problems
  - From: Edgar Fuß
- Re: WAPL/RAIDframe performance problems
  - From: David Laight
References:
- WAPL/RAIDframe performance problems
  - From: Edgar Fuß
Prev by Date: Re: xen spl rework
Next by Date: Re: NFS panic
Previous by Thread: WAPL/RAIDframe performance problems
Next by Thread: Re: WAPL/RAIDframe performance problems
Indexes:
Home | Main Index | Thread Index | Old Index