Subject: Re: SCSI performance degradation figures
To: Toru Nishimura <nisimura@itc.aist-nara.ac.jp>
From: Chuck Silvers <chuq@chuq.com>
List: port-pmax
Date: 08/25/2001 01:56:12
ok, first let's compare hardware:
my test machine is a ds5000/200 with 120MB of RAM.
this model does not have the IOASIC.
I used a 120MB file size for iozone so that it's the size of RAM.
here are some results for 1.5 (with the extra iozone spew removed):
2 pmax:~ # iozone -s 120m -f /mnt/file -i 0 -i 1
KB reclen write rewrite read reread
122880 4 3069 800 2619 2631
4.242u 93.663s 4:50.12 33.7% 0+0k 19289+30779io 22pf+0w
5 pmax:~ # time dd if=/dev/rsd0e of=/dev/null bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 50 secs (2516582 bytes/sec)
0.093u 6.055s 0:49.96 12.2% 0+0k 7+1io 12pf+0w
6 pmax:~ # time dd if=/dev/zero of=/dev/rsd0e bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 48 secs (2621440 bytes/sec)
0.189u 11.124s 0:48.83 23.1% 0+0k 0+2io 0pf+0w
14 pmax:~ # time pt -r -c 30720 -s 4096 /mnt/file
125829120 bytes transferred in 46.714 secs (2693557 bytes/sec)
0.319u 18.526s 0:46.93 40.1% 0+0k 1945+1io 0pf+0w
here are the same tests on -current with my ubc-perf stuff applied:
5 pmax:~ # iozone -s 120m -f /mnt/file -i 0 -i 1
KB reclen write rewrite read reread
122880 4 2140 975 2033 2053
5.731u 112.940s 5:08.42 38.4% 0+0k 12+59io 35pf+0w
8 pmax:~ # time dd if=/dev/rsd0e of=/dev/null bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 51 secs (2467237 bytes/sec)
0.561u 5.794s 0:50.55 12.5% 0+0k 0+1io 5pf+0w
9 pmax:~ # time dd if=/dev/zero of=/dev/rsd0e bs=32k count=3840
3840+0 records in
3840+0 records out
125829120 bytes transferred in 68 secs (1850428 bytes/sec)
0.268u 13.013s 1:07.76 19.5% 0+0k 0+2io 0pf+0w
1 pmax:~ # time pt -r -c 30720 -s 4096 /mnt/file
125829120 bytes transferred in 58.113 secs (2165238 bytes/sec)
0.504u 14.682s 0:58.51 25.9% 0+0k 11+1io 11pf+0w
2 pmax:~ # time pt -r -c 120 /mnt/file
125829120 bytes transferred in 54.554 secs (2306476 bytes/sec)
0.101u 10.116s 0:55.04 18.5% 0+0k 9+1io 11pf+0w
("pt" is a simple program I wrote that just does reads or writes of
various sizes, very similar to dd except that it doesn't need to read
from /dev/zero in order to generate writes. I use it instead of dd usually.
I can supply source if anyone cares.)
as you can see, the scsi write throughput of -current is greatly
reduced on my hardware compared to 1.5 (2621KB/s in 1.5 vs. 1850KB/s in
-current). taking that into account, writes via the filesystem are
as you would expect, about the same percentage slower in -current.
I don't know why you're seeing such a huge drop in filesystem write
throughput on your hardware. I doubt it's a pmap issue, though.
reads are a different story, however. I see about the same scsi read
throughput, but reads via the filesystem are 28% slower in -current.
increasing the application buffer size from 4KB to 1MB is somewhat better,
only 17% slower. that's still pretty bad.
looking at some profile data for the read case, we see:
1.5:
% cumulative self self total
time seconds seconds calls us/call us/call name
49.26 26.30 26.30 memmove
14.62 34.10 7.80 1959 3984.02 3997.57 mips_idle
12.89 40.98 6.88 htonl
3.54 42.87 1.89 _mcount
3.44 44.71 1.84 498612 3.68 3.68 _splset
2.30 45.93 1.23 61602 19.91 158.31 syscall
...
0.01 53.29 0.01 1968 3.97 106.77 spec_strategy
-current + ubc-perf:
% cumulative self self total
time seconds seconds calls us/call us/call name
33.65 24.26 24.26 2010 12068.56 12068.56 mips_idle
27.74 44.26 20.00 memcpy
12.48 53.25 9.00 __mcount
3.47 55.76 2.50 _mcount
2.10 57.27 1.52 _splset_noprof
1.51 58.36 1.09 17583 61.76 165.47 genfs_getpages
...
0.01 72.02 0.01 2036 3.84 109.01 spec_strategy
we're doing about the same number of
so it appears that there's a lot more idle time for this operation in
-current than there was in 1.5. this doesn't make any sense to me.
it occured to me that the pt program might be being paged out and
read back in, which would slow things down, but the results are the same
even when pt uses mlockall() to keep itself in memory.
the only other thing I can think of is maybe we still have some kind of
interrupt latency problem, but that doesn't seem too likely.
-Chuck
On Sat, Aug 25, 2001 at 02:43:19PM +0900, Toru Nishimura wrote:
> > could you post the results for running iozone on a raw device too?
> > on my ds5000/200, raw device writes are 50% slower since the integration
> > of thorpej_scsipi vs. 1.5 (with the MI SCSI in both cases).
>
> I have no good idea to run iozone on raw device. The previous raw
> figures are obtained by dd bs=32k on rsd0.
>
> > also, could you give some more info about your setup, such as how much
> > memory in your machine and the parameters you use to run iozone?
>
> Sure. 20MHz R3000A <- corrected, 40MB MAXINE. 40MB write/read to
> flood available memory. The NCR is clocked at 25MHz has IOASIC
> DMA assist.
>
> I happen to hook PMAZ-A and PMAZB-A on the same MAXINE. The
> three TURBOchannel NCRes have internals different from each other.
>
> Now, here is the result of RZ28, measured with 40MB of iozone.
>
> (iozone) K48 X1 CHS0
> PMAZ-A
> W sec 41.98 72.87 69.24
> R sec 38.93 46.16 43.16
> W B/s 999,119 575,587 605,763
> R B/s 1,077,396 908,644 971,803
>
> PMAZB-A
> W sec 37.31 74.11 68.37
> R sec 30.13 37.36 33.72
> W B/s 1,124,176 565,956 613,471
> R B/s 1,392,069 1,122,672 1,243,862
>
> Raw write/read performance of RZ28, measured with dd bs=32k. These
> look consistent between K48 and CHS0.
>
> (dd) PMAZ-A PMAZB-A
> W B/s 1,104,000 1,332,000
> R B/s 1,170,000 2,607,000
>
> I ran iozone measurements on DEC3000/600. RZ28 disk with 175MHz
> 21064 processor, 8bit fast SCSI transfer, 1.5X kernel, 64MB iozone
> to flood memory.
>
> (iozone) 3000/600, PMAZC-A equivalent, 1.5X
> W B/s 4,142,000
> R B/s 4,722,000
>
> The figure is quite impressive and implicative.
> - Same OS, same filesysystem, same SCSI automata, same SCSI driver.
> - Same SCSI adaptor with parametric differences; 8bit with 12.5MHz
> TURBOchannel interconnect while 8bit-fast with 25MHz.
> - Mostly same machine architecture; enhancements for larger memory
> bandwidth.
> - Differs in processor, R3000 v.s. 21064, and cache capacities.
>
> -nisimura