Number of GPU clock cycles

hi there!
i work for my institute and we try to develop a speedup-formula for the cuda system.

i tried a few things to detect the current runtime for a written code. no problem.

now, we think that it would be good to know the speedup in case of used clock cycles for the operations.

is there a function in cuda that could return the number of used clock cycles on the gpu? or have someonean ideahow to detect that? i have tried some things but there are only coarse assessments…

thanks for reading and thinking about that. if someone have an idea i would be grateful. i am programming with c++

greets
daozz

The time measurement is just the counts of clock cycles of your device.
By having the time measured, you should know how many cycles passed.
They are the same concept. I do not know what you are really looking for.

I am not sure what you are looking for, but would suggest checking whether any of the metrics or events provided by the nvprof profiler provide the desired numbers (the IPC metric, for example). Use the command line switches --query-events and --query-metrics to have nvprof show the events and metrics available on a particular GPU.

clock64() will give you an exact number of clock ticks elapsed on an SMX. It’s extremely useful for precise benchmarking of kernels.

Microbenchmarking individual instructions is much trickier, since the SPs are pipelined and there’s a long and variable latency for each operation, plus the multiple warps are not deterministically scheduled. So you can’t just use clock64() before and after an instruction to know “how fast” it is.
The usual trick is to make an ultra simple unrolled loop of just the instruction you like, ie, for example hundreds of sqrt()s one after another. All threads do this. Then you can use clock64() at the start of the block and the end of the block to find the net clocks used by all threads. Divide by the number of sqrt() calls and the number of threads, and you get a good idea of the clock throughput of sqrt(). The large number of threads and ultra-unrolled loop will amortize the overhead of the loop counter and non-sqrt() opcodes.

Thanks for your answer.
i forgot to explain exactly… i know the time.h file für c++ including clock() and CLOCKS_PER_SEC. now, i need something like this… or could i use these things for my gpu? the problem is that i get unrealistic results… maybe i have to consider sth.?

e.g a very simple, dirty c++ program:


clock_t t1,t2;

    t1 = clock();  
    for(int i = 0; i < 1000; i++){
    std::cout << "some kind of result: " << (((1+2+2)*123123123)%23)/23.3 
              << std::endl;  
}
    t2 = clock();
    
    std::cout << "t1: " << t1 << " t2: " << t2 << "\nno of cycles (t2-t1): " 
              << t2-t1 << std::endl;

output:
[…some result outputs…]
t1: 1 t2: 559
no of cycles (t2-t1): 557

is it possible to to use this mechanic for a global_ method? sth like:


[...]
t1 = clock();
foo<<<dimGrid, dimBlock>>>(a,b,c);
t2 = clock();
[...]

is clock() counting the clock cycles of the cpu ONLY or could i use this for cuda too? i am not sure about this…

greets
daozz

You cannot directly use CPU timing primitives for timing kernel function executions, since kernel launches are asynchronous. If you use clock(), you will only measure the time to submit the kernel job to the GPU, not the time for completing the execution of the kernel.

You have two possibilities.

The first is to use a barrier like cudaDeviceSynchronize() which halts the execution of the CPU host thread which has issued cudaDeviceSynchronize() until the GPU has finished processing all previously requested cuda tasks (kernels, memory transactions, etc.). As an example:

clock_t begin = clock();
Kernel<<<grid,block>>>(...);
cudaDeviceSynchronize();
clock_t end = clock();

The second is to use the approach using CUDA events. As an example:

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
 
cudaEventRecord(start, 0);
Kernel<<<grid,block>>>(...);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
 
cudaEventElapsedTime(&time, start, stop);
printf ("Time for the kernel: %f ms\n", time);

You can create your own timing classes using CUDA events. The library at http://www.orangeowlsolutions.com/bluebird contains a timing class constructed around this standard approach.

There are two clock() functions and it’s important not to confuse them. One is the standard C/C++ function on the host It usually has a resolution of 1/60 of a second. This is terrible for any benchmarking.

But the accurate and important clock64() is supported on the device and it is an actual count of clock ticks.

I use it in microbenchmarking by running a single block fully populated with 1024 threads to keep the SMX occupied. Each thread evaluates clock64() at the start of the block and the end. There’s an unrolled loop in the middle calling sqrt() or whatever many many times. I send the clock64() results back to the host.

Unfortunately we have no guarantees about warp scheduling, but it works out OK in practice to take the first (smallest) of all the strart clock values as the single “start” and the last (largest) of all the clocks as the end. I difference them, divide by thread count, divide again by the operation count, and that gives a decent (and usually quite repeatable) idea of the throughput of the operation in clocks.

Yes, there is uncertainty and overhead with the loop variables, and with the start and end clock values, but a few million evals by 1024 threads is enough to amortize those away to subclock noise.

There’s a good but dated paper on CUDA microbenchmarking, with code.

Also remember that for basic math operations like sqrt and such, the C programming guide actually gives a table of throughputs for many operations.

Followup… I was mostly talking about microbenchmarking small routines or even single operations.
When you’re benchmarking entire kernels, use JFSebatian’s example with CUDA Events. That’s the most precise timing of entire kernel execution time. Avoid using C++'s clock or other OS timers… they’re really imprecise. The CUDA event timer is also imprecise (as compared to clock level counting on the device itself) but they’ll have less error than any timer wrapper you can put around CUDA launch code.

Thanks a lot for your support. i will try this your ideas when the time has come! i think i have to study the cuda event machanism generally… it seems to be a nice thing

Are you able to post a simple code snippet of your microbenchmarking code? Curious to try it out with some of my own operations on my DP code on overclocked GTX Titan.

Can you explain what is the “JFSebatian’s example with CUDA Events”?

The post above where he gives a code example for timing using CUDA events.

Another question, I use the code “unsigned int start=clock();” in the kernel and print it out in the main(). But it is a negative number(sometimes it is a positive number), I don’t know why.

I tried to use long int, but it still print out a negative number.

The value of clock() or clock64() isn’t meaningful by itself. You want to use differences of two clock return values… just subtract the start value from the end value. You can still get a possible wraparound problem with clock() but clock64’s large range effectively solves that.

If using the SMX clock remember each one has its own clock and these are not syncrhonised.
With some GPU (eg C2050) the clocks drift a part the more you use the GPU.
With others device reset appears to re-sync the clocks (and then they drift apart again).
I have some code which make a best effort stab at combining them for a kernel running
across many multi-processors. Let me know if you want a copy.
Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/

I am trying to understand the difference between two clock() , one on the device and one on the host
so If both clock run on the respective hardware which have different clock speeds 893MHz (GPU) and 1.7GHz(CPU) that is why cpuResults and gpuResults are coming different. Is my hypothesis correct ?

Here is the abstract representation of my code.

__global__ void intKernel(int * input, int * result) {
	unsigned long startClock, stopClock;
	startClock = clock();
	// mycode , only register are used.
	stopClock = clock();
        *result = stopClock - startClock;
}

cpuStartClock=clock();
intKernel<<<Dg, Db>>>(data,gpuResult);
cudaDeviceSynchronize();
cpuStopClock =clock();

cpuResult=stopCpuClock - startCpuClock;