We have a test program executing a loop in a thread pinned to one CPU in this manner:
#define ISB asm volatile ("isb sy" : : : "memory")
#define DSB asm volatile ("dsb sy" : : : "memory")
#define DMB asm volatile ("dmb sy" : : : "memory")
DSB;
ISB;
auto t1 = std::chrono::high_resolution_clock::now();
uint64_t i;
uint64_t sum = 0;
uint64_t j = 10;
uint64_t iterations = 0xfffffff;
for(i= 0;i<iterations;i++) {
<OP>
j++;
}
DSB;
ISB;
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>( t2 - t1 ).count();
The whether the Denver or ARM core is faster depends on the operation. With this setup:
float sum = 0;
float j = 10;
...
sum += i*std::log(j);
…the time per instruction on the Denver core is ~33ns, while on A57 it is ~23,80ns.
If the following instructions:
for(i= 0;i<iterations;i++) {
sum += i*std::log(i);
}
then execution time on the Denver cores is ~8.76, on the A57 it is around 22.98.
-
Can somebody explain why the performance is so bad on the Denver core when using floats?
-
Sometimes, we can see a sudden improvement of the execution time on the Denver cores and they suddenly get faster than the ARM cores. I assume this is because of Dynamic Code Optimization? But why does that not occur on every execution of the same program (with the loop presented above, which should be a prime example to optimize), but rather just approximately one out of four times?