Precision of events for recording time elapsed of a kernel

I wish to understand better the precision in calculating time of a kernel duration. In http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g40159125411db92c835edb46a0989cd6 it is given the impression that a resolution of 5 microseconds is possible, but the below simple example is giving me different results when compared to nvvp

#include <stdio.h>
__global__ void test(
        int * pos, int _repeat
        )
{
        for (int repeat=0;repeat<_repeat;repeat++)
        {
                        atomicAdd(pos+repeat,repeat);

        }

}
main()
{
        int repeat=1000;
        int * pos;
        cudaMalloc(&pos,sizeof(int)*repeat);
        dim3 threadDim;
        threadDim.x=32;
        threadDim.y=32;
        threadDim.z=1;
        cudaEvent_t begin;
        cudaEvent_t end;
        cudaEventCreateWithFlags(&begin,cudaEventBlockingSync);
        cudaEventCreateWithFlags(&end,cudaEventBlockingSync);
        cudaEventRecord(begin);
        test<<<500,threadDim>>>(pos,repeat);
        cudaEventRecord(end);
        cudaDeviceSynchronize();
        float elapsed;
        cudaEventElapsedTime (&elapsed,begin,end);
        printf ("%f\n",elapsed);
        exit(0);

}

When running this code on K40 nvvp reports a kernel duration of 3.444ms, but the events output give 3.58ms a difference of more the 100microseconds.
Are events so in-precise to calculate time or nvvp is not suitable to make such calculations

These observations are unlikely indicative of a lack of timing resolution (resolution in the microsecond range is hihly probable), but due to other factors. For example, what exactly is being measured, how it is measured (steady state?), magnitude of measurement noise level (what kind of spread do you observe when you run the same kernel repeatedly?)

3.58 ms vs 3.444 ms is a difference of 4%, barely more than measurement noise (by default I assume 2%). This small differences could be (partially) due to the fact that the profiler injects measurement points at the lowest possible software stack level, and the difference with your measurements may therefore be (partially) attributable to the overhead of the additional software layers. Host-side overhead is generally minimized by using a CPU with high single-thread performance.

What is the specific use case that you are trying to address that requires you to determine run time more accurately than within 4%? Natural variation in run-time, especially for memory intensive kernels, may already be greater than that. Run time can also vary due to variable clocks due to device temperature. Are you using application clocks to lock in particular device clocks for the duration of profiling activity?

I recall multiple posts from a knowledgeable NVIDIA engineer in the past warning against the use of CUDA events for precise timing and instead recommending the use of high-precision host-side timers. I do not recall the reasoning behind this advice. Typical high-precision host system timers should be able to provide about 1 microsecond resolution.

My use case is that I am implementing a new algorithm and need to compare with other algorithms to show its efficiency. Since I am in the range of 10x factor a 4% error can be problematic when measuring as it can corrupt the factor. I am writing a scientific paper and things need to be accurate

The other problem is that at the end I have to add up GPU usage and the numbers just come out crazy! I will re-check on this because if it maintains a 4% one can just factor it in.

Is there a way how to ensure clocks are constant on a K40? such that I can factor out any issue related to variable clocks?

Set persistence mode and then set application clocks. Both can be done using the nvidia-smi tool.

nvidia-smi --help

You may need root privilege.

You would want to pick a consistent, reproducible set of numbers. So either use your own measurements (don’t forget to validate and calibrate!) throughout your work, or use the results from nvvp throughout your work.

You would also want to repeat your measurements multiple times to account for deviations from run to run, and then pick a methodology for processing the results, e.g. median, arithmetic average, or fastest time. For example, if I recall correctly, the STREAM benchmark uses a best-of-ten methodology, while the SPEC CPU benchmarks use a median-of-three methodology.

from greg:

https://devtalk.nvidia.com/default/topic/1027845/cuda-programming-and-performance/why-would-code-run-1-7x-faster-when-run-with-nvprof-than-without-/post/5228199/#5228199

profilers measure kernel execution time more accurately than is possible with cuda events.