Denver cores being slower than A57 cores

We have a test program executing a loop in a thread pinned to one CPU in this manner:

        #define ISB	asm volatile ("isb sy" : : : "memory")
        #define DSB	asm volatile ("dsb sy" : : : "memory")
        #define DMB	asm volatile ("dmb sy" : : : "memory")


        auto t1 = std::chrono::high_resolution_clock::now();
        uint64_t i;
        uint64_t sum = 0;
        uint64_t j = 10;
        uint64_t iterations = 0xfffffff;

        for(i= 0;i<iterations;i++) {


        auto t2 = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>( t2 - t1 ).count();

The whether the Denver or ARM core is faster depends on the operation. With this setup:

        float sum = 0;
        float j = 10;
        sum += i*std::log(j);

…the time per instruction on the Denver core is ~33ns, while on A57 it is ~23,80ns.

If the following instructions:

        for(i= 0;i<iterations;i++) {
            sum += i*std::log(i);

then execution time on the Denver cores is ~8.76, on the A57 it is around 22.98.

  1. Can somebody explain why the performance is so bad on the Denver core when using floats?

  2. Sometimes, we can see a sudden improvement of the execution time on the Denver cores and they suddenly get faster than the ARM cores. I assume this is because of Dynamic Code Optimization? But why does that not occur on every execution of the same program (with the loop presented above, which should be a prime example to optimize), but rather just approximately one out of four times?

Floating point log is a longer sequence however you implement it. GCC doesn’t inline it, which means that the adds in the loop also aren’t reassociated (even if you use fastmath).
Denver emulates ARM. And the microcode translation for float is known to be not that great . Best you can try compiling with clang

Thanks for the insight, that makes sense! Is clang the compiler better suited to compile for this platform then?

Is there a way to influence the DCO in any way? By structuring the code in a special way/compiler flags/compiler attributes in the code/etc? Is there a documentation somewhere where I could learn more about this topic?