Statement of clock() is not in the right order

Hi all,

Recently, I am writing codes to measure some instructions latencies in clock cycles level. We are trying to understand the latency of registers shuffling compared with shared memory access. I tried to understand the performance of function “__shfl()”, and how the access pattern intra-warp will affect the latency. (For example, threads read data from the neighbour could be faster than access data from far away threads)

Then I write code in following way:

clock_t start = clock();
sum *= __shfl(sum, threadIdx.x + 1);
… (Process this instruction several times)
clock_t stop = clock();

… Calculate time difference here.

But later, I got a strange results, which motivates me to read assembly code, which is generated by “cuobjdump” from executable file. And I was surprised to find that the calling of function “clock()” is not in the position I expected.
The send clock() function was called just one statement after first call, which results in only 1 clock cycle reported every time.

Hope that anyone has encountered this problem and can help me fix it. I guess this is caused by compiler optimisation by some ways.

Best Regards

can’t “volatile” help as usual?


I tried it but it didn’t help. I am also confused about what happened.

i can answer the seciond question - optimizing compilers feel free to reorder operations, drop useless computations and so on


I agree with your points. I am confused why the compiler can reorder the position of functions making time stamps.
Furthermore, the clock() function measures the time in cycles level. The position of this function is quite important.

Best Regards,

An optimizing compiler may move code in the absence of dependencies, and the second call to clock() here has neither a data nor a control dependency on any of the preceeding computation. Compare this similar question on Stackoverflow about timing of CPU code (