The performance of a murder case caused by an instruction optimization is crazy by 50%. Lang becomes stupid after using the - ffast math option

Keywords: C++ Assembly Language Optimize optimization

Recently, during optimization, some functions were bench tested on different compilation platforms . Found a lot of problems.

Now let's share one of the questions.

 1 typedef float MAFloat;
 3 MAFloat sma(const MAFloat* seq, const int cnt, const int N, const int M)
 4 {
 5     const MAFloat C1 = (MAFloat)M/N;
 6     const MAFloat C2 = (MAFloat)(N-M)/N;
 7     MAFloat result = 0.f;
 8     int total = cnt;
10 #pragma nounroll
11     for (int i = 0; i < total; ++i)
12     {
13         result = result * C2 + seq[i] * C1;
14     }
16     return result;
17 }

The test code is very simple, only one loop, only arithmetic operation is done in the loop, and the assembly code is also very easy.

The test platform includes:

win10: platform, vc120, gcc10, clang11

centos8: platform, gcc8, gcc10, clang11

vc: use the option / arch:sse2 /O2, and win32

gcc: use the option - ffast math - O2 - M32

clang: use options  - ffast-math -O2 -m32

The array length is 28884 = 7221 * 4;

cpu   core i5, 3.5Ghz

Test results:

win10: platform, vc120 (0.06x ms), gcc10 (0.06x ms), clang11   (0.09x ms)

centos8: platform, gcc8 (0.06x ms), gcc10 (0.06x ms), clang11   (0.09x ms)

Whether on win10 or CentOS 8 platforms, the performance of the code compiled by clang is 50% worse than that compiled by vc or gcc.

Now let's compare the assembly code produced by gcc10 and clan11

## gcc 

    movss    (%edx,%eax,4), %xmm1    # xmm1 = seq[i]
    mulss    %xmm3, %xmm0         # xmm0 = result * C2
    addl    $1, %eax            # 
    mulss    %xmm2, %xmm1         # xmm1 = seq[i] * C1 
    addss    %xmm1, %xmm0         # result = xmm0 + xmm1
    cmpl    %ecx, %eax    
    jl    .L149               # next loop

## clang

LBB7_3:                                 # =>This Inner Loop Header: Depth=1            
    movss    (%eax,%edx,4), %xmm4            # xmm4 = mem[0],zero,zero,zero    
    mulss    %xmm1, %xmm3           # 
    incl    %edx    
    cmpl    %ecx, %edx    
    mulss    %xmm0, %xmm4    
    addss    %xmm4, %xmm3    
    mulss    %xmm2, %xmm3           # xmm2 = 1/N;
    jl    LBB7_3    

The assembly code generated by gcc has a total of 7 instructions, and the assembly code generated by clang has a total of 8 instructions, one mulss more.

For some reason, clang acted wisely

result * C2 + seq[i] * C1;

Optimized into

(1/N) * (result * (N-M) + seq[i] * M);

Even if there is one more mulss instruction, the performance will not be worse by 50%, just like the difference between 7 instructions and 10.5 instructions.

Now let's analyze

My machine uses i5 3.5Ghz and can run 3.5 instruction cycles in 1ns.

The array length is 28884, that is, the loop code is executed 28884 times

Run time is   28884 * (loop body instruction cycle)/   three point five

I now roughly regard each instruction cycle as 1, and the running time of the code generated by gcc is roughly   28884 * 7 / 3.5 = 57768ns, which is basically equivalent to the test result at 0.06ms. Estimated by the same method, the running time of the code generated by clang is roughly   28884 * 8 / 3.5 = 66020ns.

However, different instructions execute different numbers of micro instructions (uop), that is, delay. mulss is 4 or 5, addss is 3, and other instructions of the above assembly code are 1.

    mulss    %xmm2, %xmm1         # xmm1 = seq[i] * C1 
    addss    %xmm1, %xmm0         # result = xmm0 + xmm1

In the above two instructions, addss   Dependent mulss   The result is% xmm1, that is, addss   The execution of mulss must be delayed for 4 or 5 cycles before execution. Due to the disorder mechanism of cpu, other instructions can be executed in other ALU s within the number of delayed cycles. Therefore, the assembly code generated by gcc can be regarded as the loss of no instruction cycle.

Let's look at the assembly code generated by clang

    mulss    %xmm0, %xmm4    
    addss    %xmm4, %xmm3    
    mulss    %xmm2, %xmm3           # xmm2 = 1/N;

addss   Dependent mulss   Results in% xmm4, then mulss   rely on   addss   The result is%xmm3. Here, we equate the first dependency with the dependency in gcc assembly, so the next dependency must wait for 3 cycles. There are only 8 instructions in a cycle, and the total delay of the two dependencies is 8 instruction cycles. In case of out of order, there are no instructions to execute, so there are 3 or 4 instruction cycles to wait.

Running time suddenly becomes   28884 * (8+3) / 3.5 = 90778ns.

The estimated results are basically consistent with the test results.

Interested friends can test the assembly on godlot. Once you let Lang use the - ffast math option, this stupid thing will happen.


Posted by LordPsyan on Thu, 04 Nov 2021 06:24:53 -0700