Timing inside the kernel How to measure times inside the kernel?

chemacecilia · December 18, 2009, 10:22am

Hello there,

I am looking for a way to measure the device memory access and some parts of my kernel code in general. I found out some ways to do it such as Create CPU timers or GPU timers and tune my kernel only with the parts that i wanna measure. However, there should be another precise method to do that inside the kernel.

Any ideas about that?

Thank you very much.

Regards

Jimmy_Pettersson · December 18, 2009, 11:01am

maybe cuda visual profiler or Nexus could give you some help there?

chemacecilia · December 18, 2009, 11:09am

I don not have experience with Nexus because I use linux but with CudaVisualProfiler you can get the estimate throughtput which is not bad, but I am looking for the most accurate one.

Anyway may be nexus can help

Jimmy_Pettersson · December 18, 2009, 1:50pm

yeah, anyways i haven’t had time to play much with nexus yet but it seems to be full of nice toys! :)

jack · December 18, 2009, 3:35pm

You can use the clock() method, which returns clock_t. It uses the internal cycle counter on the chip to measure times inside the kernel. Check out page 114 of the CUDA 2.3 programming guide (page 122 of the PDF).

mp3butcher · December 18, 2009, 4:01pm

oups i havent read well the question

jack · December 18, 2009, 5:24pm

Well, if you want to measure timing results inside the kernel without the profiler (e.g. at runtime, on a customer’s machine) the clock() method is the only way to go.

EricJ · December 20, 2009, 9:18am

I have the same type of problem. How do you do so that
clock()
somecode
clock()

is not changed by the compiler to:
clock()
clock()
somecode

inside the kernel?

thanks

chemacecilia · December 21, 2009, 11:14am

Hello all,

I have used the clock function inside the kernel and it works quite well. The way I do it is

clock()

somecode

clock()

and checking the ptx code looks like the compiler doesn’t modify this order, and also the results makes sense at all.

avidday · December 21, 2009, 12:44pm

You need to be a little cautious in assuming that the PTX code reflects reality. PTX is only an intermediate representation of what really happens on the GPU, there is still scope for further optimization and reordering of instructions at runtime. I am fairly sure that someone (probably Sylvian Collange) has previously demonstrated that runtime reordering of clock() calls can happen in the final instruction stream that hits the silicon.

chemacecilia · December 21, 2009, 12:57pm

Sure! you right the ptx is only an approximation code. However, I have cheked the resutls and they make sense and also I have checked the decuda code which more similar than the PTX to the final code.

Anyways, can you tell me the paper where it is demonstrate that?

Thank you very much

Topic		Replies	Views
How to measure time IN a kernel? CUDA Programming and Performance	10	3678	August 25, 2010
Profiling inside a kernel CUDA Programming and Performance	1	2262	May 8, 2009
How to measure time in cuda kernel ...? [CUDA 4.0] CUDA Programming and Performance	2	1275	May 7, 2013
How to measure time in kernel function on devices? CUDA Programming and Performance	2	1406	September 25, 2011
can you help me measring the run time, memory time CUDA Programming and Performance	1	3032	June 25, 2008
How to tell if a kernel is memory or compute bound CUDA Programming and Performance	8	9311	February 4, 2010
time measurement discrepancy timer, clock(), profiling CUDA Programming and Performance	4	6693	April 7, 2010
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	5931	September 8, 2009
measure inside kernel measure inside a kernel CUDA Programming and Performance	1	1219	July 16, 2009
Is there a function like cpu_time for GPUs, please nvc, nvc++ and nvfortran	2	345	October 26, 2022

Timing inside the kernel How to measure times inside the kernel?

Related topics