port-alpha: Re: PII vs 21164

Subject: Re: PII vs 21164
To: None <javi@cse.ucsc.edu>
From: Christian von Kleist <cvk@zybx.com>
List: port-alpha
Date: 05/16/2003 10:22:23
     Wow, thanks for the info!  That clarifies a few things for me:  for
example, I didn't know that Alpha instructions were just 32 bits!  I
guess that's what happens when a CS junior stretches his knowledge a
little thin.  :)

>>      Finally, there's a big difference in code size between the 21164
>> and
>> the PII.  The PII has a huge collection of CISC instructions that
>> result in small assembly code size because a smaller number of
>> assembly instructions are require per line of source code.  Small
>> assembly code size means less memory traffic during compilation and
>> assembly and less disk traffic when the resultant object code is
>> written to disk.  In contrast the 21164 has a very small RISC
>> instruction set that requires more machine code instructions per line
>> of source code.  Also, the 21164 creates a 64-bit instruction from
>> each of the assembly instructions.
>
> You lost me here, the assembly instruction is what the alpha actually
> uses... it does not generate an instruction out of those assembly
> instructions. The instruction size for the Alpha is 32bit actually, the
> native data path however (data size) is 64bit -common misconception-

     I don't know a lot about the instruction set of the 21164 and similar
Alphas so this example is in MIPS assembler (I took a course on MIPS
assembler last semester):

lw $s0, 0($s1)     ; load some word from memory
addi $s0, $s0, 5   ; $s0 += 5
sw $s0, 0($s1)     ; put $s0 back in memory
beq $s0, $0, TEST  ; if ($s0 == 0) call procedure TEST
jr $ra             ; return to calling procedure

     Some code fragment like those might exist in memory after GCC has
compiled some source file (minus the comments).  Now the assembler
has to create machine code for it like this:

01100101100011010100111001010101
01001010101010101011000000000101
10001101110010110100010111010101
01110000101001010110100001010101
00111010101101001001011001110001

     Each of these corresponds of course to one of the assembly
instructions (although the semester is over so I just typed a bunch
of 1's and 0's, not anything valid!!).  The assembly instructions
have to be transferred in bits and pieces in and out of memory during
the assembler's interpretation of the assembly code and then finally
each of the machine code instructions has to be transferred to memory
and finally to a .o file on disk.  My point was that since there is a
larger instruction count for the Alpha there is much more memory
transfer during assembly (it would be even worse if instructions were
64 bits) and more transfer to disk.  That would seem to make the
PII's memory transfer requirements smaller during compilation.

     For what it's worth I do actually know that modern CISC processors
are RISC processors in disguise after the CISC instruction is decoded
into several microcode instructions.  :)

     My knowledge is limited but I get closer to competence all the time.  :)

--
c v k @ z y b x . c o m

<quote who="Francis. Javier Mesa">
>>
>>      One thing that comes to mind are the cache differences between
>> the
>> 21164 and the PII.  The 21164 has 8kb of L1 cache and 96kb of L2
>> cache, and the PII has 32kb of L1 and 256- to 2048kb of L2 (yours
>> probably has 256- or 512kb).
>
> The L2 in the alpha is actually on chip, the L2 for the PII is off chip
> and runs at 1/2 the processor speed... so the smaller L2 in the Alpha is
> not as bad as it seems since even though it is smaller than the 512K
> that the PII uses for L2, it has much lower latency associated with it.
>
>   Size isn't all that matters in caches,
>> but assuming the cache hardware works about equally well in both
>> processors the PII should have a distinct advantage with its larger L2
>> cache, even if it's only running at 66MHz (could also be 100MHz).
>
> The L2 cache for the PII is actually on the same PCB board as the
> processor and it runs at 1/2 the internal speed of the PII core.
>
>> Fortunately the 21164 can also operate an L3 cache of considerable
>> size on the motherboard.  When I put a 2mb beta cache in my PWS500a I
>> saw a near doubling of compile performance!  If you don't have a beta
>> cache module in your machine installing one might really help.
>
> As a rule of thumb RISC machines need far larger caches than CISC
> machines. In fact cache is fundamental to see the benefits of RISC. One
> of the reasons is what you pointed out, RISC machines are far more
> memory hungry than CISC when it comes to actual instruction
> requirements.
>
> CISC machines were designed to deal with limited main memories. Since
> RAM was rather pricey and slow in old times. So basically you wanted to
> do as few memory accesses as possible, both to force programs to be as
> compact as possible (ergo reducing RAM requirements) and to reduce the
> delay associated with memory accesses (most CISC machines were not
> pipelined). Those CISC instructions once fetched, are actually decoded
> into a microcode sequence, basically you should think of CISC as a sort
> of "instruction" compression. RISC basically gets rid of the decoding
> into microcode overhead, instead what you fetch is the microcode itself.
> By using a tuned memory hierarchy (caches) and pipelining RISC can make
> up for the increased instruction banwidth requirements over CISC
> counterparts.
>
>>      Finally, there's a big difference in code size between the 21164
>> and
>> the PII.  The PII has a huge collection of CISC instructions that
>> result in small assembly code size because a smaller number of
>> assembly instructions are require per line of source code.  Small
>> assembly code size means less memory traffic during compilation and
>> assembly and less disk traffic when the resultant object code is
>> written to disk.  In contrast the 21164 has a very small RISC
>> instruction set that requires more machine code instructions per line
>> of source code.  Also, the 21164 creates a 64-bit instruction from
>> each of the assembly instructions.
>
> You lost me here, the assembly instruction is what the alpha actually
> uses... it does not generate an instruction out of those assembly
> instructions. The instruction size for the Alpha is 32bit actually, the
> native data path however (data size) is 64bit -common misconception-
>
>   CISC instructions can be
>> considerably longer than 64 bits, but the average length is probably
>> pretty close to something like 64 bits.  That means the PII is moving
>> much less information from disk to main memory to cache to processor
>> during both the compile and assembly stages.
>
> The problem with CISC is that although it is true that there are lower
> instruction bandwidth requirements, the fact that instructions are not
> regular (i.e. fixed in size) actually presents a significant overhead
> (i.e. fetching of a very long instruction may require several memory
> accesses and the IM unit needs to know how many memory acceses are
> needed), where as for the RISC machines there is no overhead since every
> instruction takes the same "amount" of fetching.
>
> Also note that the PII is actually a risc core. The CISC instructions
> are actually broken down into RISC subinstructions, which it almos
> behave like microcode really... but with risc instructions as the
> microcode. So that whole decoding into smoller RISC instructionlets also
> presents a significan overhead.
>
>
>>      The 21164 is definitely as fast as the PII in integer
>> performance.
>> (Of course it is way faster at FP performance, but aside from using
>> the FP registers as a kind of cache FP is not an issue here.)  In the
>> end I think the only explanation for the longer compile time on your
>> 21164 is that more memory transfer is being done.  That includes all
>> three levels: cache, main memory, and disk.
>
> There are 2 issues. First off PII's integer performance is quite up to
> par with old Alpha's. Intel concentrated mostly on Integer performance.
> And compiling is an integer intensive process. Second GCC is far more
> optimized for x86 than AXP, so there is also a significant difference in
> compiling times right there.