Subject: Re: raidframe consumes cpu like a terminally addicted
To: Robert Elz <kre@munnari.OZ.AU>
From: Greg Oster <oster@cs.usask.ca>
List: netbsd-users
Date: 04/30/2001 09:31:49
Robert Elz writes:
>     Date:        Mon, 30 Apr 2001 00:46:02 +0200
>     From:        Matthias Buelow <mkb@mukappabeta.de>
>     Message-ID:  <20010430004602.A1934@altair.mayn.de>
> 
> Greg, I'm not sure if you read netbsd-users,

Ya.. I try to :)

> so I have added you
> explicitly, I think you are the one person who knows enough about
> raidframe to answer this quickly...
> 
>   | and that's the label of the raid5 set on top of them:
>   | 
>   | # /dev/rraid1d:
> 
>   | bytes/sector: 512
>   | sectors/track: 256
>   | tracks/cylinder: 1
>   | sectors/cylinder: 256
>   | cylinders: 139970
> 
> Yes, that's certainly not a good layout.   If you don't have the kernel
> patch mentioned on the list, then allocating new blocks is (sometimes)
> likely to be very slow (CPU intensive).

This is the problem that Herb mentioned...

> You'll also most likely find that you're wasting more filesys space in
> overheads than you would really like - I'll bet that when you newfs'd
> that filesys it was printing "alternate superblock numbers" until you
> never thought they would stop...   That's because this layout will
> cause way too many cylinder groups (with all their headers, etc) with
> way too few blocks in them.
> 
> But I doubt this is the cause of your ls -l problems.
> 
>   | During the test (the ls -l on a directory with ~1000 entries) I have
>   | looked on disk i/o with systat iostat and there was almost nothing
>   | (most of it was in the buffer cache anyways) so insufficient bandwidth
>   | shouldn't be the problem in that case, imho.
> 
> I wonder if perhaps raid5 is requiring that the parity be recomputed
> (checked) every time the blocks are accessed.   

No, it doesn't.

> Most likely the 1000
> entries will be overflowing the vnode cache, meaning each stat() will
> require the inode to be fetched from the buffer cache (and of course,
> certain the first time they're referenced).   1000 inodes from one directory
> isn't much of a buffer cache load though, so quite likely all the inode
> blocks are in memory - hence little or no actual I/O.
> 
> But if doing a read from the buffer cache requires the raid code to
> validate the raid5 parity each time, then there's likely to be a lot of
> CPU overhead involved there (1000 times 3 times 2K word reads (ie, 6M RAM
> accesses), plus the computations).

and if it was validating the parity each time, yes, it would kill things...

> However, I am speculating, Greg will know if anything like that might
> possibly be causing high CPU usage while doing an "ls -l" on a raid5
> directory containing lots of files (which doesn't happen doing a similar
> access on a non-raid filesys).

A couple of other things:

Matthias Buelow writes (in a previous message):
> START layout
> # sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level
> 32 1 1 5

That's only 16K per component in this case, which is probably too small 
for the best performance... 32K per component may perform better.

Also: do a 'time ls -l > /dev/null' on the offending directory, and a 
'time ls -l > /dev/null' on the "single disk of the same type".  I'm 
interested in seeing what those say... 

merlin# cat * > /dev/null
merlin# time ls -l > /dev/null
0.4u 0.4s 0:01.13 76.9% 0+0k 53+4io 41pf+0w
merlin# time ls -l > /dev/null
0.4u 0.4s 0:00.94 93.6% 0+0k 0+4io 0pf+0w
merlin# time ls -l > /dev/null
0.4u 0.4s 0:00.94 94.6% 0+0k 0+0io 0pf+0w
merlin# ls -1 | wc
   21613   21613  231666
merlin# du -sk .
252408  .
merlin# 

(RAID 5 over 3 IDE disks, stripe width of 64, disklabel that looks (in part)
like:
bytes/sector: 512
sectors/track: 128
tracks/cylinder: 32
sectors/cylinder: 4096
cylinders: 43976
i386 box running 1.5.1_BETA as of Apr. 7)

Later...

Greg Oster