https://www.cnblogs.com/bbqzsl/p/15510377.html
Recently, during optimization, some functions were bench tested on different compilation platforms . Found a lot of problems.
Now let's share one of the questions.
1 typedef float MAFloat; 2 3 MAFloat sma(const MAFloat* seq, const int cnt, const int N, const int M) 4 { 5 const MAFloat C1 = (MAFloat)M/N; 6 const MAFloat C2 = (MAFloat)(N-M)/N; 7 MAFloat result = 0.f; 8 int total = cnt; 9 10 #pragma nounroll 11 for (int i = 0; i < total; ++i) 12 { 13 result = result * C2 + seq[i] * C1; 14 } 15 16 return result; 17 }
The test code is very simple, only one loop, only arithmetic operation is done in the loop, and the assembly code is also very easy.
The test platform includes:
win10: platform, vc120, gcc10, clang11
centos8: platform, gcc8, gcc10, clang11
vc: use the option / arch:sse2 /O2, and win32
gcc: use the option - ffast math - O2 - M32
clang: use options - ffast-math -O2 -m32
The array length is 28884 = 7221 * 4;
cpu core i5, 3.5Ghz
Test results:
win10: platform, vc120 (0.06x ms), gcc10 (0.06x ms), clang11 (0.09x ms)
centos8: platform, gcc8 (0.06x ms), gcc10 (0.06x ms), clang11 (0.09x ms)
Whether on win10 or CentOS 8 platforms, the performance of the code compiled by clang is 50% worse than that compiled by vc or gcc.
Now let's compare the assembly code produced by gcc10 and clan11
## gcc .L149: movss (%edx,%eax,4), %xmm1 # xmm1 = seq[i] mulss %xmm3, %xmm0 # xmm0 = result * C2 addl $1, %eax # mulss %xmm2, %xmm1 # xmm1 = seq[i] * C1 addss %xmm1, %xmm0 # result = xmm0 + xmm1 cmpl %ecx, %eax jl .L149 # next loop ## clang LBB7_3: # =>This Inner Loop Header: Depth=1 movss (%eax,%edx,4), %xmm4 # xmm4 = mem[0],zero,zero,zero mulss %xmm1, %xmm3 # incl %edx cmpl %ecx, %edx mulss %xmm0, %xmm4 addss %xmm4, %xmm3 mulss %xmm2, %xmm3 # xmm2 = 1/N; jl LBB7_3
The assembly code generated by gcc has a total of 7 instructions, and the assembly code generated by clang has a total of 8 instructions, one mulss more.
For some reason, clang acted wisely
result * C2 + seq[i] * C1;
Optimized into
(1/N) * (result * (N-M) + seq[i] * M);
Even if there is one more mulss instruction, the performance will not be worse by 50%, just like the difference between 7 instructions and 10.5 instructions.
Now let's analyze
My machine uses i5 3.5Ghz and can run 3.5 instruction cycles in 1ns.
The array length is 28884, that is, the loop code is executed 28884 times
Run time is 28884 * (loop body instruction cycle)/ three point five
I now roughly regard each instruction cycle as 1, and the running time of the code generated by gcc is roughly 28884 * 7 / 3.5 = 57768ns, which is basically equivalent to the test result at 0.06ms. Estimated by the same method, the running time of the code generated by clang is roughly 28884 * 8 / 3.5 = 66020ns.
However, different instructions execute different numbers of micro instructions (uop), that is, delay. mulss is 4 or 5, addss is 3, and other instructions of the above assembly code are 1.
mulss %xmm2, %xmm1 # xmm1 = seq[i] * C1
addss %xmm1, %xmm0 # result = xmm0 + xmm1
In the above two instructions, addss Dependent mulss The result is% xmm1, that is, addss The execution of mulss must be delayed for 4 or 5 cycles before execution. Due to the disorder mechanism of cpu, other instructions can be executed in other ALU s within the number of delayed cycles. Therefore, the assembly code generated by gcc can be regarded as the loss of no instruction cycle.
Let's look at the assembly code generated by clang
mulss %xmm0, %xmm4 addss %xmm4, %xmm3 mulss %xmm2, %xmm3 # xmm2 = 1/N;
addss Dependent mulss Results in% xmm4, then mulss rely on addss The result is%xmm3. Here, we equate the first dependency with the dependency in gcc assembly, so the next dependency must wait for 3 cycles. There are only 8 instructions in a cycle, and the total delay of the two dependencies is 8 instruction cycles. In case of out of order, there are no instructions to execute, so there are 3 or 4 instruction cycles to wait.
Running time suddenly becomes 28884 * (8+3) / 3.5 = 90778ns.
The estimated results are basically consistent with the test results.
Interested friends can test the assembly on godlot. Once you let Lang use the - ffast math option, this stupid thing will happen.