Number of GPU clock cycles

DaOzZ · August 15, 2013, 1:17pm

hi there!
i work for my institute and we try to develop a speedup-formula for the cuda system.

i tried a few things to detect the current runtime for a written code. no problem.

now, we think that it would be good to know the speedup in case of used clock cycles for the operations.

is there a function in cuda that could return the number of used clock cycles on the gpu? or have someonean ideahow to detect that? i have tried some things but there are only coarse assessments…

thanks for reading and thinking about that. if someone have an idea i would be grateful. i am programming with c++

greets
daozz

xiuxiuish · August 15, 2013, 2:24pm

The time measurement is just the counts of clock cycles of your device.
By having the time measured, you should know how many cycles passed.
They are the same concept. I do not know what you are really looking for.

njuffa · August 15, 2013, 4:54pm

I am not sure what you are looking for, but would suggest checking whether any of the metrics or events provided by the nvprof profiler provide the desired numbers (the IPC metric, for example). Use the command line switches --query-events and --query-metrics to have nvprof show the events and metrics available on a particular GPU.

SPWorley · August 15, 2013, 4:55pm

clock64() will give you an exact number of clock ticks elapsed on an SMX. It’s extremely useful for precise benchmarking of kernels.

Microbenchmarking individual instructions is much trickier, since the SPs are pipelined and there’s a long and variable latency for each operation, plus the multiple warps are not deterministically scheduled. So you can’t just use clock64() before and after an instruction to know “how fast” it is.
The usual trick is to make an ultra simple unrolled loop of just the instruction you like, ie, for example hundreds of sqrt()s one after another. All threads do this. Then you can use clock64() at the start of the block and the end of the block to find the net clocks used by all threads. Divide by the number of sqrt() calls and the number of threads, and you get a good idea of the clock throughput of sqrt(). The large number of threads and ultra-unrolled loop will amortize the overhead of the loop counter and non-sqrt() opcodes.

DaOzZ · August 15, 2013, 5:14pm

Thanks for your answer.
i forgot to explain exactly… i know the time.h file für c++ including clock() and CLOCKS_PER_SEC. now, i need something like this… or could i use these things for my gpu? the problem is that i get unrealistic results… maybe i have to consider sth.?

e.g a very simple, dirty c++ program:

clock_t t1,t2;

    t1 = clock();  
    for(int i = 0; i < 1000; i++){
    std::cout << "some kind of result: " << (((1+2+2)*123123123)%23)/23.3 
              << std::endl;  
}
    t2 = clock();
    
    std::cout << "t1: " << t1 << " t2: " << t2 << "\nno of cycles (t2-t1): " 
              << t2-t1 << std::endl;

output:
[…some result outputs…]
t1: 1 t2: 559
no of cycles (t2-t1): 557

is it possible to to use this mechanic for a global_ method? sth like:

[...]
t1 = clock();
foo<<<dimGrid, dimBlock>>>(a,b,c);
t2 = clock();
[...]

is clock() counting the clock cycles of the cpu ONLY or could i use this for cuda too? i am not sure about this…

greets
daozz

JFSebastian · August 15, 2013, 7:14pm

You cannot directly use CPU timing primitives for timing kernel function executions, since kernel launches are asynchronous. If you use clock(), you will only measure the time to submit the kernel job to the GPU, not the time for completing the execution of the kernel.

You have two possibilities.

The first is to use a barrier like cudaDeviceSynchronize() which halts the execution of the CPU host thread which has issued cudaDeviceSynchronize() until the GPU has finished processing all previously requested cuda tasks (kernels, memory transactions, etc.). As an example:

clock_t begin = clock();
Kernel<<<grid,block>>>(...);
cudaDeviceSynchronize();
clock_t end = clock();

The second is to use the approach using CUDA events. As an example:

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
 
cudaEventRecord(start, 0);
Kernel<<<grid,block>>>(...);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
 
cudaEventElapsedTime(&time, start, stop);
printf ("Time for the kernel: %f ms\n", time);

You can create your own timing classes using CUDA events. The library at http://www.orangeowlsolutions.com/bluebird contains a timing class constructed around this standard approach.

SPWorley · August 15, 2013, 8:21pm

There are two clock() functions and it’s important not to confuse them. One is the standard C/C++ function on the host It usually has a resolution of 1/60 of a second. This is terrible for any benchmarking.

But the accurate and important clock64() is supported on the device and it is an actual count of clock ticks.

I use it in microbenchmarking by running a single block fully populated with 1024 threads to keep the SMX occupied. Each thread evaluates clock64() at the start of the block and the end. There’s an unrolled loop in the middle calling sqrt() or whatever many many times. I send the clock64() results back to the host.

Unfortunately we have no guarantees about warp scheduling, but it works out OK in practice to take the first (smallest) of all the strart clock values as the single “start” and the last (largest) of all the clocks as the end. I difference them, divide by thread count, divide again by the operation count, and that gives a decent (and usually quite repeatable) idea of the throughput of the operation in clocks.

Yes, there is uncertainty and overhead with the loop variables, and with the start and end clock values, but a few million evals by 1024 threads is enough to amortize those away to subclock noise.

There’s a good but dated paper on CUDA microbenchmarking, with code.

Also remember that for basic math operations like sqrt and such, the C programming guide actually gives a table of throughputs for many operations.

SPWorley · August 15, 2013, 11:00pm

Followup… I was mostly talking about microbenchmarking small routines or even single operations.
When you’re benchmarking entire kernels, use JFSebatian’s example with CUDA Events. That’s the most precise timing of entire kernel execution time. Avoid using C++'s clock or other OS timers… they’re really imprecise. The CUDA event timer is also imprecise (as compared to clock level counting on the device itself) but they’ll have less error than any timer wrapper you can put around CUDA launch code.

DaOzZ · August 16, 2013, 11:08pm

Thanks a lot for your support. i will try this your ideas when the time has come! i think i have to study the cuda event machanism generally… it seems to be a nice thing

vacaloca · August 17, 2013, 2:01pm

SPWorley:

There are two clock() functions and it’s important not to confuse them. One is the standard C/C++ function on the host It usually has a resolution of 1/60 of a second. This is terrible for any benchmarking.

But the accurate and important clock64() is supported on the device and it is an actual count of clock ticks.

I use it in microbenchmarking by running a single block fully populated with 1024 threads to keep the SMX occupied. Each thread evaluates clock64() at the start of the block and the end. There’s an unrolled loop in the middle calling sqrt() or whatever many many times. I send the clock64() results back to the host.

Unfortunately we have no guarantees about warp scheduling, but it works out OK in practice to take the first (smallest) of all the strart clock values as the single “start” and the last (largest) of all the clocks as the end. I difference them, divide by thread count, divide again by the operation count, and that gives a decent (and usually quite repeatable) idea of the throughput of the operation in clocks.

Are you able to post a simple code snippet of your microbenchmarking code? Curious to try it out with some of my own operations on my DP code on overclocked GTX Titan.

mountain-thu · February 17, 2014, 3:38am

Can you explain what is the “JFSebatian’s example with CUDA Events”?

SPWorley · February 17, 2014, 3:47am

The post above where he gives a code example for timing using CUDA events.

mountain-thu · February 17, 2014, 1:57pm

Another question, I use the code “unsigned int start=clock();” in the kernel and print it out in the main(). But it is a negative number(sometimes it is a positive number), I don’t know why.

I tried to use long int, but it still print out a negative number.

SPWorley · February 18, 2014, 12:07am

The value of clock() or clock64() isn’t meaningful by itself. You want to use differences of two clock return values… just subtract the start value from the end value. You can still get a possible wraparound problem with clock() but clock64’s large range effectively solves that.

wlangdon · February 19, 2014, 8:23pm

If using the SMX clock remember each one has its own clock and these are not syncrhonised.
With some GPU (eg C2050) the clocks drift a part the more you use the GPU.
With others device reset appears to re-sync the clocks (and then they drift apart again).
I have some code which make a best effort stab at combining them for a kernel running
across many multi-processors. Let me know if you want a copy.
Bill

bharat629 · June 16, 2017, 11:49am

I am trying to understand the difference between two clock() , one on the device and one on the host
so If both clock run on the respective hardware which have different clock speeds 893MHz (GPU) and 1.7GHz(CPU) that is why cpuResults and gpuResults are coming different. Is my hypothesis correct ?

Here is the abstract representation of my code.

__global__ void intKernel(int * input, int * result) {
	unsigned long startClock, stopClock;
	startClock = clock();
	// mycode , only register are used.
	stopClock = clock();
        *result = stopClock - startClock;
}

cpuStartClock=clock();
intKernel<<<Dg, Db>>>(data,gpuResult);
cudaDeviceSynchronize();
cpuStopClock =clock();

cpuResult=stopCpuClock - startCpuClock;

Topic		Replies	Views
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	12925	July 9, 2008
Cuda program taking more time. CUDA Programming and Performance	15	7054	November 21, 2010
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35986	July 12, 2011
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9345	January 7, 2008
How to accurately time individual memory operations CUDA Programming and Performance	17	6164	September 12, 2016
clock() doesn't work properly CUDA Programming and Performance	10	6277	July 3, 2009
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10569	June 21, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	664	April 4, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8607	December 18, 2008
SPMT: Single Program Multiple (Exeuction) Time CUDA Programming and Performance	15	3897	July 4, 2009

Number of GPU clock cycles

Related topics