ARM Cortex-A72 slow multiply (MADD) instruction execution

To: port-arm%NetBSD.org@localhost
Subject: ARM Cortex-A72 slow multiply (MADD) instruction execution
From: Sad Clouds <cryintothebluesky%gmail.com@localhost>
Date: Wed, 15 Apr 2020 13:20:18 +0100

So I got myself RPI-4 with Cortex-A72 CPU, currently running OpenSuse
Linux. I'm benchmarking hardware, so I don't think the type of OS
matters that much.

# dmesg | grep Machine
[    0.000000] Machine model: Raspberry Pi 4 Model B Rev 1.1

# lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
CPU max MHz:                     1500.0000
CPU min MHz:                     600.0000
BogoMIPS:                        108.00
NUMA node0 CPU(s):               0-3

I was a bit surprised to find that on this hardware int64 multiply
instruction (MADD) is rather slow, compared to int32, float and double.

I have some custom benchmarking code I wrote to do performance
testing. It runs a loop where it performs some arithmetic operations on
8 variables with no dependencies. The value "val" is marked as
"volatile" so that it inhibits any compiler optimisations. The idea is
to measure throughput of CPU pipelines for add, subtract, multiply,
divide and modulo C operators. Below is a fragment for multiply
function. 

while (n != 0)
{
	n--;
	n1 *= val;
	n2 *= val;
	n3 *= val;
	n4 *= val;
	n5 *= val;
	n6 *= val;
	n7 *= val;
	n8 *= val;
}

n is some large value like 1 million.
n1 to n8 are initialized to 3.
val is initialized to 1.

Below are test results for Intel and ARM CPUs in Mops - mega operations
per second. Note how Cortex-A72 int64 multiply throughput is about
1/3rd of int32 multiply. 

Xeon E5620:
GCC 8.3.0. CFLAGS="-O3 -Wall -pedantic -std=c11 -march=native"
Mul (Mops): int32=2390.08, int64=2388.03, flt=2328.10, dbl=2389.01, ldbl=645.95

Cortex-A72:
GCC 9.3.1. CFLAGS="-O3 -Wall -pedantic -std=c11 -mcpu=cortex-a72"
Mul (Mops): int32=1498.49, int64=499.47, flt=1498.78, dbl=1497.74, ldbl=15.28

Looking at output "objdump -d -M no-aliases" the assembly code is very
similar for Cortex-A72 int32 and int64, except the different registers
Wn vs Xn.

Assembly for int64 multiply loop:
  40451c:       fd000380        str     d0, [x28]
  404520:       aa1303e0        orr     x0, xzr, x19
  404524:       34000274        cbz     w20, 404570 <int64_mul+0xa0>
  404528:       f94037e2        ldr     x2, [sp, #104]
  40452c:       71000694        subs    w20, w20, #0x1
  404530:       f94037e1        ldr     x1, [sp, #104]
  404534:       f94037e3        ldr     x3, [sp, #104]
  404538:       9b027e73        madd    x19, x19, x2, xzr
  40453c:       f94037e2        ldr     x2, [sp, #104]
  404540:       9b017c00        madd    x0, x0, x1, xzr
  404544:       f94037e1        ldr     x1, [sp, #104]
  404548:       9b037f5a        madd    x26, x26, x3, xzr
  40454c:       f94037e3        ldr     x3, [sp, #104]
  404550:       9b027f39        madd    x25, x25, x2, xzr
  404554:       f94037e2        ldr     x2, [sp, #104]
  404558:       9b017f18        madd    x24, x24, x1, xzr
  40455c:       f94037e1        ldr     x1, [sp, #104]
  404560:       9b037ef7        madd    x23, x23, x3, xzr
  404564:       9b027ed6        madd    x22, x22, x2, xzr
  404568:       9b017eb5        madd    x21, x21, x1, xzr
  40456c:       54fffde1        b.ne    404528 <int64_mul+0x58>  // b.any
  404570:       f90033e0        str     x0, [sp, #96]

I don't know much about ARM architecture. Is it generally well known
that int64 multiplication is rather slow on aarch64? If not, then
could it be down to compiler optimizations? Although looking at the
assembly code, it looks virtually identical. Could it be down to
"ldr Xn" instructions that require more bandwidth to load 64-bit
registers, but then this doesn't seem to be an issue for double data
types. Any other suggestions?

Follow-Ups:
- Re: ARM Cortex-A72 slow multiply (MADD) instruction execution
  - From: Sad Clouds

Prev by Date: Re: No sound FriendlyARM NanoPi M1 Allwinner H3 SoC board
Next by Date: Re: ARM Cortex-A72 slow multiply (MADD) instruction execution
Previous by Thread: No sound FriendlyARM NanoPi M1 Allwinner H3 SoC board
Next by Thread: Re: ARM Cortex-A72 slow multiply (MADD) instruction execution
Indexes:

Home | Main Index | Thread Index | Old Index