clock() function on titanX

Using clock() function on titanX is not working as expected. Code shown below is the back 2 back clock() functions to capture the latency. When i run the kernel start_time is ahead of stop_time which is not usual. So, I added printf after the clock() and the code works fine. I am not sure if this is a known issue or if this is caused because of very small kernel exiting soon.

Code not working:

unsigned int start_time = 0, stop_time = 0;
	start_time = clock();	
	stop_time = clock();

Code working:

unsigned int start_time = 0, stop_time = 0;

	start_time = clock();	
	stop_time = clock();

    printf("Clock %d %d\n", start_time, stop_time);

Please let know if anyone faced similar issue or

clock() can roll over

you may also be experiencing a compiler optimization/reordering effect

finally, it’s possible in some circumstances (dynamic parallelism) for the clock function to give spurious results, because threadblocks can be rescheduled from one SM to another.

Thanks txbob, the kernel configuration is just a single thread I don’t think rescheduling/ reordering matters. Because I checked the ptx and assembly and the order of instructions look good. I also ran this on Fermi and Kepler architecture, the same code works fine without printf.

Dynamic parallelism, I thought about it but and I checked the assembly function and there is no function call to be a dynamic parallelism. It’s just a register move operation to get the clock().

Clock roll over:
I tried with ‘long long int’ and clock64() function. Even this gave me similar results. When I add the printf the time looks good.

Please let me know if u think something else could be causing the problem.

just one possible trick:

start = clock();
x = computation();
if (x!=42) stop = clock();

BulatZiganshin, This is exactly why I was using the clock(), to estimate latency of some computation. In my case, there are 256 dependent add instructions. But the latency for these 256 instructions is just 8 cycles, which I think is not expected. I tried printing the start and stop times but this doesn’t work this time. It still reports 8 cycles.

Or maybe you’ve just made a mistake in your code in the “failing” case. It’s possible that the problem is outside of the code you have shown. You might get better responses if you provide a short, complete code that demonstrates the issue.

__global__ void kclock(unsigned int *ts)
{
	unsigned int start_time = 0, stop_time = 0;

	start_time = clock();
	
	stop_time = clock();

    //printf("Clock %d %d\n", start_time, stop_time);
	
	ts[(blockIdx.x*blockDim.x + threadIdx.x)*2] = start_time;
	ts[(blockIdx.x*blockDim.x + threadIdx.x)*2+1] = stop_time;
}

This is the kernel. I also added the kernel call below.

kclock <<<Dg, Db>>> (d_ts);
	cudaThreadSynchronize();
	cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);

The values printed from the cudaMemcpy data are not correct.

“if (x!=42)” is the key part, this ensures that the second assignment will be performed after computation of x

Its due to instruction reordering by the compiler. If you think it is not, please present your analysis of the SASS code.

If you want further help from me, you’ll need to provide a complete example:

  1. A complete code that I can copy, paste, compile, and run, without having to add anything or change anything.
  2. Your actual output data.
  3. A description of your platform: GPU, OS, CUDA version, compile command

Here is my test case, on CUDA 7.5, Fedora 20, GTX960, borrowing the technique proposed by BulatZiganshin:

$ cat t952.cu
#include <stdio.h>


__global__ void kclock(unsigned int *ts)
{
  unsigned int start_time = 0, stop_time = 0;

  start_time = clock();
#ifdef FIX
  if (start_time != 42)
#endif
  stop_time = clock();

    //printf("Clock %d %d\n", start_time, stop_time);

  ts[(blockIdx.x*blockDim.x + threadIdx.x)*2] = start_time;
  ts[(blockIdx.x*blockDim.x + threadIdx.x)*2+1] = stop_time;
}

int main(){

        int Dg = 1, Db = 1;
        unsigned int *ts, *d_ts;
        ts=(unsigned int *)malloc(Dg*Db*2*sizeof(unsigned int));
        cudaMalloc(&d_ts, Dg*Db*2*sizeof(unsigned int));
        kclock <<<Dg, Db>>> (d_ts);
  cudaDeviceSynchronize();
  cudaMemcpy(ts, d_ts, sizeof(ts), cudaMemcpyDeviceToHost);
        printf("%u, %u\n", ts[0], ts[1]);
        return 0;
}
$ nvcc -arch=sm_52 -o t952 t952.cu
$ ./t952
419170408, 419169995
$ nvcc -DFIX -arch=sm_52 -o t952 t952.cu
$ ./t952
420292229, 420292254
$

With the code modification, the 2nd value is greater than the first

I am sorry txbob for not providing all the code.
I will try the “if (x!=42)” part BulatZiganshin.

Thanks

  1. even better, you should include dependency on both ends of chain:
start = clock()
if (start!=42) x = 1
x += y
if (x!=42) stop = clock()

this ensures that first clock() will be called before additions and second one - after additions. but even in this sequence additions may reordered due to Math laws

  1. optimizer can replace 256 additions with single multiplication. you need to take special care to prohibit that

overall, correct and meaningful benchamrking is knowledge area of its own. it’s full of traps… :(

at least, start with providing all info listed by txbob, and try to look yourself into SASS generated

Adding the condition worked.
Thanks, BulatZiganshin and txbob.

Just a later thread saying thanks, this helped me get on the right track.

In my case, I observed the following, I had:

start clock
big body of code (sprinkled with middle clock timings)
end clock

I observed timings that didn’t make much sense, all the clocks were within a few ticks of each other. So I compiled it to cubin and used the sass flag to look at the assembly, and the code matched. I was baffled. Finally after compiling to cubin with the parameter for 5.2 compute capability, I observed different assembly, it was doing this:

big body of code
sprinkled middle clock timing attempts, with start clock mixed in there
end clock

It seems due to a lack of dependencies, nvcc just reordered non-dependent code and threw all the clocks at the bottom.

I had to force a dependency on the top and the bottom as BulatZiganshin suggested. My dependent computing was j, so I had:

if (j)
start clock
big body of code computing j, sprinkled with (if j) run middle clocks
if (j)
end clock

The assembly showed it finally kept the clocks from reordering. A downside of all this is it’s harder to measure a single instruction. In my case I’m measuring 1024 instructions, so the overhead of the timing is minimal. I suppose the only way around this is compiling in debug mode, usually that keeps existing code order? Anyway, I’m back on track, and I don’t think I would have got here without this threads help. Thanks again!

Even a simpler way is to add

__syncthreads() after the start clock()
__syncthreads() before the stop clock()

This should make sure the clock() are not rearranged

what you want to measure with a single instruction? delay? throughput? GPU executes instructions like conveyour builds cars, so before measuring “car assembling time” you need to decide what exactly you trying to measure.