Re: -falign-functions=16 for i386/amd64

To: matthew green <mrg%eterna.com.au@localhost>
Subject: Re: -falign-functions=16 for i386/amd64
From: Ryota Ozaki <ozaki-r%netbsd.org@localhost>
Date: Mon, 5 Sep 2016 18:28:42 +0900

On Mon, Sep 5, 2016 at 1:19 PM, matthew green <mrg%eterna.com.au@localhost> wrote:
> Ryota Ozaki writes:
>> On Thu, Sep 1, 2016 at 4:04 PM, matthew green <mrg%eterna.com.au@localhost> wrote:
>> > have you tested other values than 1 and 16?  what about 4 or 8?
>>
>> 4 and 8 are not so good; their performance fluctuations are
>> similar to the unaligned case in my experiments.
>>
>> >
>> > can you post the size difference of kernels?  particularly the
>> > kernel without DIAGNOSTIC or DEBUG (since those are the ones
>> > where performance matters most.)
>>
>> I measured the sizes of GENERIC kernels, i.e., DIAGNOSTIC on
>> and DEBUG off.
>
> DIAGNOSTIC is enabled on most -current GENERIC kernels including
> the amd64 one.  it's disabled on release branches.

I tried without DIAGNOSTIC. The overhead due to alignment doesn't
change but the total text size of the kernel is reduced by 660kB,
so the ratio of overhead increases a bit (< 1%).

>
>> The sizes of kernel binaries don't change in most cases because
>> the alignment of __rodata_start that begins just after kernel text
>> hides the changes due to -falign-functions.
>>
>> The sizes of the actual kernel text (from kernel_text to _etext)
>> slightly changes. The difference between that of GENERIC kernels
>> w/ and w/o -falign-functions=16 is 200kB. That is 1% of the total
>> kernel text size.
>>
>> BTW, as I noted, I'm not exploring an alignment size that provides
>> best performance, I just want to reduce performance fluctuations.
>
> 200KB is a lot of text.  that's a non trivial i-cache issue.
>
> what are the CPU specifics of the system you're testing on?

dut1# cpuctl identify 0
cpu0: highest basic info 0000000b
cpu0: highest extended info 80000008
cpu0: "Intel(R) Atom(TM) CPU  C2558  @ 2.40GHz"
cpu0: Intel Atom C2000 (686-class), 2400.27 MHz
cpu0: family 0x6 model 0x4d stepping 0x8 (id 0x406d8)
cpu0: features 0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE>
cpu0: features 0xbfebfbff<MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2>
cpu0: features 0xbfebfbff<SS,HTT,TM,SBF>
cpu0: features1 0x43d8e3bf<SSE3,PCLMULQDQ,DTES64,MONITOR,DS-CPL,VMX,EST,TM2>
cpu0: features1 0x43d8e3bf<SSSE3,CX16,xTPR,PDCM,SSE41,SSE42,MOVBE,POPCNT>
cpu0: features1 0x43d8e3bf<DEADLINE,AES,RDRAND>
cpu0: features2 0x28100800<SYSCALL/SYSRET,XD,RDTSCP,EM64T>
cpu0: features3 0x101<LAHF,PREFETCHW>
cpu0: I-cache 32KB 64B/line 8-way, D-cache 24KB 64B/line 6-way
cpu0: L2 cache 1MB 64B/line 16-way
cpu0: ITLB 48 4KB entries fully associative
cpu0: DTLB 128 4KB entries 4-way, 4K/2M: 16 entries
cpu0: Initial APIC ID 0
cpu0: Cluster/Package ID 0
cpu0: Core ID 0
cpu0: SMT ID 0
cpu0: DSPM-eax 0x5<DTS,ARAT>
cpu0: DSPM-ecx 0x9<HWF,EPB>
cpu0: SEF highest subleaf 00000000
cpu0: SEF-main 0x2282<TSCADJUST,SMEP,ERMS,FPUCSDS>
cpu0: microcode version 0x127, platform ID 0


> can you run performance tests on systems with small cache?

Not tested ever. It'll take a bit time to do because I don't
have a suitable one. BTW what size do you expect for small?

Thanks,
  ozaki-r

Follow-Ups:
- re: -falign-functions=16 for i386/amd64
  - From: matthew green

References:
- Re: -falign-functions=16 for i386/amd64
  - From: Ryota Ozaki
- re: -falign-functions=16 for i386/amd64
  - From: matthew green

Prev by Date: re: -falign-functions=16 for i386/amd64
Next by Date: A blast from the past - strange sysctl behavior
Previous by Thread: re: -falign-functions=16 for i386/amd64
Next by Thread: re: -falign-functions=16 for i386/amd64
Indexes:

Home | Main Index | Thread Index | Old Index