Subject: Re: SCSI performance degradation figures
To: Toru Nishimura <nisimura@itc.aist-nara.ac.jp>
From: Chuck Silvers <chuq@chuq.com>
List: port-pmax
Date: 08/25/2001 01:56:12
ok, first let's compare hardware:
my test machine is a ds5000/200 with 120MB of RAM.
this model does not have the IOASIC.
I used a 120MB file size for iozone so that it's the size of RAM.


here are some results for 1.5 (with the extra iozone spew removed):

2 pmax:~ # iozone -s 120m -f /mnt/file -i 0 -i 1
              KB  reclen   write rewrite    read    reread
          122880       4    3069     800     2619     2631
4.242u 93.663s 4:50.12 33.7%    0+0k 19289+30779io 22pf+0w
5 pmax:~ # time dd if=/dev/rsd0e of=/dev/null bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 50 secs (2516582 bytes/sec)
0.093u 6.055s 0:49.96 12.2%     0+0k 7+1io 12pf+0w
6 pmax:~ # time dd if=/dev/zero of=/dev/rsd0e bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 48 secs (2621440 bytes/sec)
0.189u 11.124s 0:48.83 23.1%    0+0k 0+2io 0pf+0w
14 pmax:~ # time pt -r -c 30720 -s 4096 /mnt/file
125829120 bytes transferred in 46.714 secs (2693557 bytes/sec)
0.319u 18.526s 0:46.93 40.1%    0+0k 1945+1io 0pf+0w


here are the same tests on -current with my ubc-perf stuff applied:
5 pmax:~ # iozone -s 120m -f /mnt/file -i 0 -i 1
              KB  reclen   write rewrite    read    reread
          122880       4    2140     975     2033     2053
5.731u 112.940s 5:08.42 38.4%   0+0k 12+59io 35pf+0w
8 pmax:~ # time dd if=/dev/rsd0e of=/dev/null bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 51 secs (2467237 bytes/sec)
0.561u 5.794s 0:50.55 12.5%     0+0k 0+1io 5pf+0w
9 pmax:~ # time dd if=/dev/zero of=/dev/rsd0e bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 68 secs (1850428 bytes/sec)
0.268u 13.013s 1:07.76 19.5%    0+0k 0+2io 0pf+0w
1 pmax:~ # time pt -r -c 30720 -s 4096 /mnt/file
125829120 bytes transferred in 58.113 secs (2165238 bytes/sec)
0.504u 14.682s 0:58.51 25.9%    0+0k 11+1io 11pf+0w

2 pmax:~ # time pt -r -c 120 /mnt/file
125829120 bytes transferred in 54.554 secs (2306476 bytes/sec)
0.101u 10.116s 0:55.04 18.5%    0+0k 9+1io 11pf+0w




("pt" is a simple program I wrote that just does reads or writes of
various sizes, very similar to dd except that it doesn't need to read
from /dev/zero in order to generate writes.  I use it instead of dd usually.
I can supply source if anyone cares.)


as you can see, the scsi write throughput of -current is greatly
reduced on my hardware compared to 1.5 (2621KB/s in 1.5 vs. 1850KB/s in
-current).  taking that into account, writes via the filesystem are
as you would expect, about the same percentage slower in -current.
I don't know why you're seeing such a huge drop in filesystem write
throughput on your hardware.  I doubt it's a pmap issue, though.

reads are a different story, however.  I see about the same scsi read
throughput, but reads via the filesystem are 28% slower in -current.
increasing the application buffer size from 4KB to 1MB is somewhat better,
only 17% slower.  that's still pretty bad.

looking at some profile data for the read case, we see:

1.5:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 49.26     26.30    26.30                             memmove
 14.62     34.10     7.80     1959  3984.02  3997.57  mips_idle
 12.89     40.98     6.88                             htonl
  3.54     42.87     1.89                             _mcount
  3.44     44.71     1.84   498612     3.68     3.68  _splset
  2.30     45.93     1.23    61602    19.91   158.31  syscall
...
  0.01     53.29     0.01     1968     3.97   106.77  spec_strategy



-current + ubc-perf:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 33.65     24.26    24.26     2010 12068.56 12068.56  mips_idle
 27.74     44.26    20.00                             memcpy
 12.48     53.25     9.00                             __mcount
  3.47     55.76     2.50                             _mcount
  2.10     57.27     1.52                             _splset_noprof
  1.51     58.36     1.09    17583    61.76   165.47  genfs_getpages
...
  0.01     72.02     0.01     2036     3.84   109.01  spec_strategy




we're doing about the same number of 

so it appears that there's a lot more idle time for this operation in
-current than there was in 1.5.  this doesn't make any sense to me.
it occured to me that the pt program might be being paged out and
read back in, which would slow things down, but the results are the same
even when pt uses mlockall() to keep itself in memory.

the only other thing I can think of is maybe we still have some kind of
interrupt latency problem, but that doesn't seem too likely.

-Chuck


On Sat, Aug 25, 2001 at 02:43:19PM +0900, Toru Nishimura wrote:
> > could you post the results for running iozone on a raw device too?
> > on my ds5000/200, raw device writes are 50% slower since the integration
> > of thorpej_scsipi vs. 1.5 (with the MI SCSI in both cases).
> 
> I have no good idea to run iozone on raw device.  The previous raw
> figures are obtained by dd bs=32k on rsd0.
> 
> > also, could you give some more info about your setup, such as how much
> > memory in your machine and the parameters you use to run iozone?
> 
> Sure.  20MHz R3000A <- corrected, 40MB MAXINE.  40MB write/read to
> flood available memory.  The NCR is clocked at 25MHz has IOASIC
> DMA assist.
> 
> I happen to hook PMAZ-A and PMAZB-A on the same MAXINE.  The
> three TURBOchannel NCRes have internals different from each other.
> 
> Now, here is the result of RZ28, measured with 40MB of iozone.
> 
> (iozone)	K48		X1		 CHS0
> PMAZ-A
> 	W sec	41.98		72.87		69.24
> 	R sec	38.93		46.16		43.16
> 	W B/s	  999,119	 575,587	  605,763
> 	R B/s	1,077,396	 908,644	  971,803
> 
> PMAZB-A
> 	W sec	37.31		74.11		68.37
> 	R sec	30.13		37.36		33.72
> 	W B/s	1,124,176	  565,956	  613,471
> 	R B/s	1,392,069	1,122,672	1,243,862
> 
> Raw write/read performance of RZ28, measured with dd bs=32k. These
> look consistent between K48 and CHS0.
> 
> (dd)	PMAZ-A 		PMAZB-A
> W B/s	1,104,000	1,332,000
> R B/s	1,170,000	2,607,000
> 
> I ran iozone measurements on DEC3000/600.  RZ28 disk with 175MHz
> 21064 processor, 8bit fast SCSI transfer, 1.5X kernel, 64MB iozone
> to flood memory.
> 
> (iozone)	3000/600, PMAZC-A equivalent, 1.5X
> W B/s		4,142,000
> R B/s		4,722,000
> 
> The figure is quite impressive and implicative.
> - Same OS, same filesysystem, same SCSI automata, same SCSI driver.
> - Same SCSI adaptor with parametric differences; 8bit with 12.5MHz
>   TURBOchannel interconnect while 8bit-fast with 25MHz.
> - Mostly same machine architecture; enhancements for larger memory
>   bandwidth.
> - Differs in processor, R3000 v.s. 21064, and cache capacities.
> 
> -nisimura