When timing a function, clock() varies each time. Clock() (at start) – Clock() (at end) = 9565 or

I want to time a fixed function however each time I execute the kernel I get a slightly different value. I would think (but am probably wrong) that the time would be the same each time it is called. I can understand how a CPU varies because of caching each time and cpu slices given to other processes. However, I would guess that each time a GPU kernel is called the state of the GPU is “reset” and there are no other time slice shares with other programs so it should always be the same. Does anyone know why this might be so?

Note: I noticed that this happens when memory is read only…

My sample code…

[font=“Courier New”]#define NUM_BLOCKS 1
#define NUM_THREADS 1
global static void timedReduction(const float * input, float * output, clock_t * timer)
{
timer[0] = clock();
output[0] = input[0];
timer[1] = clock();
}[/font]

Output…
(Trial 1 results) Time = 2600, 2232, 2364, 2456, 2392, 2616
(Trial 2 results) Time = 2548, 2376, 2348, 2384, 2428, 2436

Sample code 2 …

[font=“Courier New”]global static void timedReduction(const float * input, float * output, clock_t * timer)
{
timer[0] = clock();
output[0] = 1;
timer[1] = clock();
}[/font]

(Trial 1 results) Time = 178, 178, 178, 178, 178, 178
(Trial 2 results) Time = 178, 178, 178, 178, 178, 178

Because it will depend on what DRAM banks are active, etc.

Hi tmurray, Thanks for the reply.

When i run the kernel in a loop…
(Trial 1 results) Time = 2600, 2232, 2364, 2456, 2392, 2616
…isn’t this using the same DRAM banks?

I’m guessing the best solution would be to run the “sample code” several times and average it out (like would be done on a cpu).

FYI - for anyone that is interested in consistent exact timing. When i say consistent exact
timing i mean that each time the code runs it will always give you the same timing for a
chunk of code. I have not tested other video cards but it would be my guess that each
generation GPU processor (or maybe even each model vga card) will have different timings.

From my playing around it seems like exact consistent timing is possible when…

-The number of threads is 1 (I’m guessing up to 32)
-The timer is not running when there are global memory reads.
-global writes seem to be okay
-ether all global memory is copied into shared memory or registers first and then the
timer is started or pause the timer when reading from global memory.
-I did not test read-only (cached) memory - but i’m guessing the timer cannot be running
during the first read.

Global memory reads seem to differ each time they are called - they seem to be a little bit
random. There does not seem to be a consistency even when I time it in a loop in kernel,
or when I time from kernel from call to call, or when i completely restart the host application.
Global memory reads are not consistent.

Note: if anyone is trying playing around with this make sure that device emulation is not
turned on or else you will be using the CPU timings.