cudaEvents allows of upto a resolution of 0.5 microseconds (according to the reference manual). Is there any possible way of measuring more accurately (say in the nano-second or exact GPU clock cycles) the time taken to execute code? Are there any techniques to do this using the latest simulators (Barra or GPGPU-Sim) or down to PTX and driver level code?
Also given that if most factors remain constant (device, OS noise etc.) for a given kernel and input does the same kernel execute in exact number of GPU cycles on different runs, if not, what could be the expected variation for number of GPU cycles taken?
The clock() function in device code is cycle-accurate. However the compiler loves to move clock() function calls all around the kernel (like putting them all at the beginning), no matter how many volatile specifiers you add to the variables…
Of course you can call __syncthreads() before and after clock() to prevent this, but it will add some non-constant number of cycles, which defeats the purpose of having a cycle-accurate clock.
Academic simulators are not accurate enough (Barra doesn’t offer timing at this granularity, and GPGPU-Sim models a different architecture than NVIDIA’s).
The only really effective way to do clock-accurate measurements is to code them at the assembly/cubin level, using decuda and cudasm.
The main sources of noise are:
Memory accesses, which might conflict with a DRAM refresh or screen refresh (tens of cycles). Can be overcome by repeating each measurement a few times and keeping the min time.
Boundaries between clock domains add a few cycles of random slack whenever you cross one (memory transactions, block scheduling…)
The instruction scheduler, which scheduling policy is complex enough to cause chaotic behavior (a slight initial variation in timing can change the scheduling completely.)
It is still a much less noisy environment than a CPU.