Re: Some more patches for GCC on NetBSD/VAX coming soon...

To: Johnny Billquist <bqt%update.uu.se@localhost>
Subject: Re: Some more patches for GCC on NetBSD/VAX coming soon...
From: Jake Hamby <jehamby420%me.com@localhost>
Date: Thu, 31 Mar 2016 21:24:04 -0700
Minor correction: that should be the CPU speed in Hz, not MHz, that you need to substitute in the top line of the test program, in case it wasn't clear from the example of 71500000. Interestingly, if you run the "nop" part of the test on a PC (compile in 64-bit mode or else the loop iterations overflow the long), my low-end Xeon can do 4 nops per clock cycle. There are all sorts of weird x86 optimizations related to using 4 byte and 6 byte nops (which the CPU specially recognizes as nops) for alignment to minimize the cost of the CPU having to skip over them. On VAX, you might be able to do the same thing if there's a particularly long instruction that's an effective nop and is faster to execute than the number of bytes it is.

Regards,
Jake


> On Mar 31, 2016, at 21:13, Jake Hamby <jehamby420%me.com@localhost> wrote:
> 
> Yes, you're absolutely right. I was in an optimistic mood when I wrote "better state than ever before", and I hadn't yet discovered that there were all of those pieces of NetBSD that need to be compiled with "-O0" or the cross-compiler crashes. In the short term, GCC 5.3 is in a pretty bad state, but in the long term, I'm optimistic. :)
> 
> What annoyed me is that even as an experienced software developer, it takes weeks of reading and a lot of challenging effort to get up to speed on understanding how GCC does things with its .md files and macro expansions. The GCC internals doc is as well written as it can be, considering that it's just simply difficult to understand how it all fits together, but it also takes a lot of trial and error and experimentation and looking at generated files and running in the debugger (did you know that in GDB you can do "call debug_rtx(var)" when debugging cc1, and it calls that function in the code to pretty-print the rtx for you? very cool!). Now that I have a basic idea of what's going on, I think I can be of some use in getting these really bad code generation bugs figured out.
> 
> Speaking of older vs. newer machines and optimizing code generation for them, I'm working on a little cycle counting benchmark in C that should run on any VAX. I've been compiling with "-O2", but the loop overhead is subtracted out, so the optimization level doesn't matter too much. It's mostly a set of C macros to expand out any given instruction 100 times in a loop, then run that loop an appropriate number of times for the speed of the machines it's running on (which you have to provide). Then it calculates how many clock cycles that instruction takes.
> 
> Summary of the results for my 71.5 MHz NVAX: nop is 1 cycle, mull2 is 6 cycles, mull3 is 3 cycles but only when you use 3 different registers, then the 8-bit multiply is 14 cycles (mulw2 or mulw3) and 16-bit multiply is a surprisingly slow 21 cycles. So on NVAX, 32-bit multiply is very fast, and the compiler should avoid 16-bit multiply if possible. The extended multiply-add instruction (emul) takes 37 cycles on this CPU! Floating point is almost as fast as 32-bit ints: 4 for the 2-op version, 3 for the 3-op version. D_float (what NetBSD uses for double) takes 6 and 3.7 (oddly) cycles for 2 and 3 ops, and G_float is slightly slower at 6.1 and 4 cycles respectively.
> 
> I'll work on adding a complete set of tests for addition, subtraction, bit shifting, division, memory move, etc. but I'm curious what sort of results other people see on their systems. I'll have to try SimH next. :) Note that you need to put the MHz for your system at the top before compiling or else the number of clock cycles will be wrong. It's best to err a little on the high side: I originally used 71 MHz and got numbers like 0.99 cycles for nop, so when I use the exact speed (71.5 MHz = 286 MHz / 4), I get numbers like 1.007 cycles. The exact numbers don't matter, but it's reassuring that most of them come out so closely to an integral number of clock cycles on my system. Please let me know if you have any problems running it, and I'll work on getting the rest of the test cases finished. I'd like to generate a table of numbers to use instead of the hardcoded ones in vax_rtx_costs() and vax_address_cost_1() in gcc/config/vax/vax.c.
> 
> -Jake
> 
> $ ./cyclecount
> loop overhead is 0.070569 usec
> elapsed time for nop: 10069188 usec
> # cycles at 71 MHz: 1.006919 (71008705 ips)
> elapsed time for 32-bit int multiply (2 op): 30206553 usec
> # cycles at 71 MHz: 6.041311 (11835180 ips)
> elapsed time for 32-bit int multiply (3 op, 1 reg): 30205785 usec
> # cycles at 71 MHz: 6.041157 (11835481 ips)
> elapsed time for 32-bit int multiply (3 op, 3 reg): 15150554 usec
> # cycles at 71 MHz: 3.030111 (23596497 ips)
> elapsed time for 16-bit int multiply (2 op): 21132422 usec
> # cycles at 71 MHz: 21.132422 (3383427 ips)
> elapsed time for 16-bit int multiply (3 op, 3 reg): 21142713 usec
> # cycles at 71 MHz: 21.142713 (3381780 ips)
> elapsed time for 8-bit int multiply (2 op): 42257794 usec
> # cycles at 71 MHz: 14.085931 (5075987 ips)
> elapsed time for 8-bit int multiply (3 op, 3 reg): 42256508 usec
> # cycles at 71 MHz: 14.085503 (5076141 ips)
> elapsed time for F_floating multiply (2 op): 20205073 usec
> # cycles at 71 MHz: 4.041015 (17693576 ips)
> elapsed time for F_floating multiply (3 op, 3 reg): 15161593 usec
> # cycles at 71 MHz: 3.032319 (23579317 ips)
> elapsed time for D_floating multiply (2 op): 30342012 usec
> # cycles at 71 MHz: 6.068402 (11782343 ips)
> elapsed time for D_floating multiply (3 op, 3 reg): 18620173 usec
> # cycles at 71 MHz: 3.724035 (19199607 ips)
> elapsed time for G_floating multiply (2 op): 30572040 usec
> # cycles at 71 MHz: 6.114408 (11693691 ips)
> elapsed time for G_floating multiply (3 op, 3 reg): 20297306 usec
> # cycles at 71 MHz: 4.059461 (17613175 ips)
> elapsed time for 32-bit int multiply-add (64-bit result): 37236479 usec
> # cycles at 71 MHz: 37.236479 (1920160 ips)
> csvax1$ cat cyclecount.c
> #include <stdio.h>
> #include <sys/time.h>
> 
> // replace the number below with your CPU speed in MHz
> static unsigned long mhz = (71500000L);
> 
> static struct timeval start, finish;
> 
> #define TENX(X) X X X X X X X X X X
> #define FIFTYX(X) TENX(X) TENX(X) TENX(X) TENX(X) TENX(X)
> #define HUNDREDX(X) FIFTYX(X) FIFTYX(X)
> 
> #define MAKELOOP(NAME, X, INIT) void loop_ ## NAME (unsigned long count) {\
>    unsigned long i;\
>    unsigned long long s1 = 0, s2 = 0, s3 = 0, s4 = 0;\
>    INIT ;\
>    gettimeofday(&start, 0);\
>    for (i = count; i > 0; --i) {\
>        HUNDREDX( __asm __volatile (X : "+r" (s1), "+r" (s2), "+r" (s3), "+r" (s4)); )\
>    }\
>    gettimeofday(&finish, 0);\
> }
> 
> void empty_loop(unsigned long count) {
>    unsigned long i;
>    gettimeofday(&start, 0);
>    for (i = count; i > 0; --i) {
>         __asm __volatile (""); // don't optimize out
>    }
>    gettimeofday(&finish, 0);
> }
> 
> /*
>   1.001000 (F_float) in hex is 0x20c54080
>   1.001000 (D_float) in hex is 0xe3549ba520c44080
>   1.001000 (G_float) in hex is 0xbc6a937404184010
>   1.000010 (F_float) in hex is 0x00544080
>   1.000010 (D_float) in hex is 0x238ee2d600534080
>   1.000010 (G_float) in hex is 0xc4727c5a000a4010
>   0.999999 (F_float) in hex is 0xffef407f
>   0.999999 (D_float) in hex is 0x5f4a3908ffef407f
>   0.999999 (G_float) in hex is 0x0be9e721fffd400f
> */
> 
> MAKELOOP(nop, "nop", )
> MAKELOOP(mull2, "mull2 %0, %1", )
> MAKELOOP(mull3, "mull3 %0, %0, %0", )
> MAKELOOP(mull3_2, "mull3 %0, %1, %2", )
> MAKELOOP(mulw2, "mulw2 %0, %1", )
> MAKELOOP(mulw3, "mulw3 %0, %1, %2", )
> MAKELOOP(mulb2, "mulb2 %0, %1", )
> MAKELOOP(mulb3, "mulb3 %0, %1, %2", )
> MAKELOOP(mulf2, "mulf2 %0, %1", s1 = s2 = s3 = 0xffef407f )
> MAKELOOP(mulf3, "mulf3 %0, %1, %2", s1 = s2 = s3 = 0x20c54080ULL )
> MAKELOOP(muld2, "muld2 %0, %1", s1 = s2 = s3 = 0x5f4a3908ffef407fULL )
> MAKELOOP(muld3, "muld3 %0, %1, %2", s1 = s2 = s3 = 0xe3549ba520c44080ULL )
> MAKELOOP(mulg2, "mulg2 %0, %1", s1 = s2 = s3 = 0x0be9e721fffd400fULL )
> MAKELOOP(mulg3, "mulg3 %0, %1, %2", s1 = s2 = s3 = 0xbc6a937404184010ULL )
> MAKELOOP(emul, "emul %0, %1, %2, %3", )
> 
> static time_t loop_overhead = 0;        // overhead for 100 * base loop_count
> 
> void report_result(char *test_name, int mult, long iter) {
>    time_t elapsed = (finish.tv_sec * 1000000LL + finish.tv_usec)
>        - (start.tv_sec * 1000000LL + start.tv_usec)
>        - (loop_overhead * mult / 100);
>    double cycles = (double) elapsed / ((double) mult * 1000000.0);
>    long per_second = (long) (((double) iter * 100.0
>                                / ((double) elapsed / 1000000.0)) + 0.5);
> 
>    printf("elapsed time for %s: %ld usec\n", test_name, (long) elapsed);
>    printf("# cycles at %ld MHz: %f (%ld ips)\n",
>        (mhz / 1000000L), cycles, per_second);
> }
> 
> /* Warm up cache with short run, then report results of the real test. */
> #define RUN_TEST(FUNC, NAME, MULTIPLE) \
>    loop_ ## FUNC(500);\
>    loop_ ## FUNC(loop_count * MULTIPLE);\
>    report_result(NAME, MULTIPLE, (loop_count * MULTIPLE));
> 
> int main() {
>    /* one second per MHz per clock cycle (100 instructions per loop).
>     * For faster instructions, a multiplier is used.  */
>    const unsigned long loop_count = mhz / 100;
> 
>    /* calculate loop overhead to subtract from elapsed time */
>    empty_loop(loop_count * 100);
>    loop_overhead = ((finish.tv_sec * 1000000LL + finish.tv_usec) -
>        (start.tv_sec * 1000000LL + start.tv_usec));
>    printf("loop overhead is %f usec\n",
>        (double)(loop_overhead) / (double)(loop_count * 100));
> 
>    RUN_TEST(nop, "nop", 10)
>    RUN_TEST(mull2, "32-bit int multiply (2 op)", 5)
>    RUN_TEST(mull3, "32-bit int multiply (3 op, 1 reg)", 5)
>    RUN_TEST(mull3_2, "32-bit int multiply (3 op, 3 reg)", 5)
>    RUN_TEST(mulw2, "16-bit int multiply (2 op)", 1)
>    RUN_TEST(mulw3, "16-bit int multiply (3 op, 3 reg)", 1)
>    RUN_TEST(mulb2, "8-bit int multiply (2 op)", 3)
>    RUN_TEST(mulb3, "8-bit int multiply (3 op, 3 reg)", 3)
>    RUN_TEST(mulf2, "F_floating multiply (2 op)", 5)
>    RUN_TEST(mulf3, "F_floating multiply (3 op, 3 reg)", 5)
>    RUN_TEST(muld2, "D_floating multiply (2 op)", 5)
>    RUN_TEST(muld3, "D_floating multiply (3 op, 3 reg)", 5)
>    RUN_TEST(mulg2, "G_floating multiply (2 op)", 5)
>    RUN_TEST(mulg3, "G_floating multiply (3 op, 3 reg)", 5)
>    RUN_TEST(emul, "32-bit int multiply-add (64-bit result)", 1)
> 
>    return 0;
> }
> 
> 
>> On Mar 30, 2016, at 06:17, Johnny Billquist <bqt%update.uu.se@localhost> wrote:
>> 
>> On 2016-03-30 08:21, Jake Hamby wrote:
>>> I'm looking at a few remaining issues in the recent update in NetBSD-current to GCC 5.3, which overall appears to be an improvement over 4.8.5. I dropped GCC-patches from the CC list because I don't think 98% of the subscribers to that list care about VAX, while I know that 100% of the subscribers to this one do. ;-)
>> 
>> Excellent work, and yes, I suspect people here will care. :-)
>> 
>> By the way, among the "few remaning issues" I would include that NetBSD cannot build natively. gcc crashes out... I already posted about this.
>> 
>>> There are a few other smaller issues with GCC & binutils that I'm looking at cleaning up, but overall, I think NetBSD/vax is in as good of shape as it has ever been. One thing I think would make sense as far as tidying up GCC's vax.md & vax.c is that I think most hobbyists are using CVAX and NVAX-based systems, or related, am I correct? I have a VAXstation 4000 VLC as well as the Model 90, and it seems like other people on the list have something from that era, as opposed to anything pre-Micro VAX. What I'd like to do is use the GMP harness to benchmark, in addition to GMP and MPFR themselves (yes, I want to update the tuning files that haven't been touched in 20 years in those libraries, too), the number of clock cycles of the different flavors of move, copy, compare, add, etc. on a CVAX and an NVAX system and then add compiler flags if the two are wildly different from each other.
>> 
>> I don't really agree with the shape is better than ever before. Not being able to build natively is a bad state, even if it has been like this for a couple of years now...
>> 
>> And I suspect I might be the only remaining person keeping an 8650 alive and running, so I can understand if it's not the default target, but please keep it as an option to target these older machines as well. (I have all instructions implemented in the hardware... :-) )
>> 
>> 	Johnny
>> 
>
Follow-Ups:
- Re: Some more patches for GCC on NetBSD/VAX coming soon...
  - From: Rhialto
- Re: Some more patches for GCC on NetBSD/VAX coming soon...
  - From: Felix Deichmann
References:
- Some more patches for GCC on NetBSD/VAX coming soon...
  - From: Jake Hamby
- Re: Some more patches for GCC on NetBSD/VAX coming soon...
  - From: Johnny Billquist
- Re: Some more patches for GCC on NetBSD/VAX coming soon...
  - From: Jake Hamby
Prev by Date: Re: Some more patches for GCC on NetBSD/VAX coming soon...
Next by Date: Re: Some more patches for GCC on NetBSD/VAX coming soon...
Previous by Thread: Re: Some more patches for GCC on NetBSD/VAX coming soon...
Next by Thread: Re: Some more patches for GCC on NetBSD/VAX coming soon...
Indexes:
Home | Main Index | Thread Index | Old Index