Same blocks and threads configuration giving different GPU utilization time

I’m doing some basic task in the kernel where I used the blocks and threads configuration as blocks(8,1,1) and threads(4,8,16) respectively. I’m taking the GPU time with the help of cudaEvents and cudaEventElapsedTime function.

When I run the program then I got the time as 0.000123sec. But when I run the same program again and again, then I got the time as 0.000213sec, 0.000146sec, 0.000170sec, 0.000202sec, 0.000299sec, etc.

So, I’m confused about showing the correct GPU time taken for the execution of some task. Which time should be considered?

Thanks in advance

How are you measuring? To get accurate GPU timing numbers, use the NVidia profiler.

Thanks for the reply.
I’m using the cudaEvents and the function cudaEventElapsedTime to measure the GPU time.
Even after using the profiler the time taken for the execution is not the same.

This difference between runs is just noise, it is the average that matters. There are other things running on the system that can (and will) affect the measurement between runs.