Timing inside the kernel How to measure times inside the kernel?

Hello there,

I am looking for a way to measure the device memory access and some parts of my kernel code in general. I found out some ways to do it such as Create CPU timers or GPU timers and tune my kernel only with the parts that i wanna measure. However, there should be another precise method to do that inside the kernel.

Any ideas about that?

Thank you very much.


maybe cuda visual profiler or Nexus could give you some help there?

I don not have experience with Nexus because I use linux but with CudaVisualProfiler you can get the estimate throughtput which is not bad, but I am looking for the most accurate one.

Anyway may be nexus can help

yeah, anyways i haven’t had time to play much with nexus yet but it seems to be full of nice toys! :)

You can use the clock() method, which returns clock_t. It uses the internal cycle counter on the chip to measure times inside the kernel. Check out page 114 of the CUDA 2.3 programming guide (page 122 of the PDF).

oups i havent read well the question

Well, if you want to measure timing results inside the kernel without the profiler (e.g. at runtime, on a customer’s machine) the clock() method is the only way to go.

I have the same type of problem. How do you do so that

is not changed by the compiler to:

inside the kernel?


Hello all,

I have used the clock function inside the kernel and it works quite well. The way I do it is




and checking the ptx code looks like the compiler doesn’t modify this order, and also the results makes sense at all.

You need to be a little cautious in assuming that the PTX code reflects reality. PTX is only an intermediate representation of what really happens on the GPU, there is still scope for further optimization and reordering of instructions at runtime. I am fairly sure that someone (probably Sylvian Collange) has previously demonstrated that runtime reordering of clock() calls can happen in the final instruction stream that hits the silicon.

Sure! you right the ptx is only an approximation code. However, I have cheked the resutls and they make sense and also I have checked the decuda code which more similar than the PTX to the final code.

Anyways, can you tell me the paper where it is demonstrate that?

Thank you very much