Port-arm archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
ARM Cortex-A72 slow multiply (MADD) instruction execution
So I got myself RPI-4 with Cortex-A72 CPU, currently running OpenSuse
Linux. I'm benchmarking hardware, so I don't think the type of OS
matters that much.
# dmesg | grep Machine
[ 0.000000] Machine model: Raspberry Pi 4 Model B Rev 1.1
# lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
NUMA node0 CPU(s): 0-3
I was a bit surprised to find that on this hardware int64 multiply
instruction (MADD) is rather slow, compared to int32, float and double.
I have some custom benchmarking code I wrote to do performance
testing. It runs a loop where it performs some arithmetic operations on
8 variables with no dependencies. The value "val" is marked as
"volatile" so that it inhibits any compiler optimisations. The idea is
to measure throughput of CPU pipelines for add, subtract, multiply,
divide and modulo C operators. Below is a fragment for multiply
function.
while (n != 0)
{
n--;
n1 *= val;
n2 *= val;
n3 *= val;
n4 *= val;
n5 *= val;
n6 *= val;
n7 *= val;
n8 *= val;
}
n is some large value like 1 million.
n1 to n8 are initialized to 3.
val is initialized to 1.
Below are test results for Intel and ARM CPUs in Mops - mega operations
per second. Note how Cortex-A72 int64 multiply throughput is about
1/3rd of int32 multiply.
Xeon E5620:
GCC 8.3.0. CFLAGS="-O3 -Wall -pedantic -std=c11 -march=native"
Mul (Mops): int32=2390.08, int64=2388.03, flt=2328.10, dbl=2389.01, ldbl=645.95
Cortex-A72:
GCC 9.3.1. CFLAGS="-O3 -Wall -pedantic -std=c11 -mcpu=cortex-a72"
Mul (Mops): int32=1498.49, int64=499.47, flt=1498.78, dbl=1497.74, ldbl=15.28
Looking at output "objdump -d -M no-aliases" the assembly code is very
similar for Cortex-A72 int32 and int64, except the different registers
Wn vs Xn.
Assembly for int64 multiply loop:
40451c: fd000380 str d0, [x28]
404520: aa1303e0 orr x0, xzr, x19
404524: 34000274 cbz w20, 404570 <int64_mul+0xa0>
404528: f94037e2 ldr x2, [sp, #104]
40452c: 71000694 subs w20, w20, #0x1
404530: f94037e1 ldr x1, [sp, #104]
404534: f94037e3 ldr x3, [sp, #104]
404538: 9b027e73 madd x19, x19, x2, xzr
40453c: f94037e2 ldr x2, [sp, #104]
404540: 9b017c00 madd x0, x0, x1, xzr
404544: f94037e1 ldr x1, [sp, #104]
404548: 9b037f5a madd x26, x26, x3, xzr
40454c: f94037e3 ldr x3, [sp, #104]
404550: 9b027f39 madd x25, x25, x2, xzr
404554: f94037e2 ldr x2, [sp, #104]
404558: 9b017f18 madd x24, x24, x1, xzr
40455c: f94037e1 ldr x1, [sp, #104]
404560: 9b037ef7 madd x23, x23, x3, xzr
404564: 9b027ed6 madd x22, x22, x2, xzr
404568: 9b017eb5 madd x21, x21, x1, xzr
40456c: 54fffde1 b.ne 404528 <int64_mul+0x58> // b.any
404570: f90033e0 str x0, [sp, #96]
I don't know much about ARM architecture. Is it generally well known
that int64 multiplication is rather slow on aarch64? If not, then
could it be down to compiler optimizations? Although looking at the
assembly code, it looks virtually identical. Could it be down to
"ldr Xn" instructions that require more bandwidth to load 64-bit
registers, but then this doesn't seem to be an issue for double data
types. Any other suggestions?
Home |
Main Index |
Thread Index |
Old Index