Profiling inside a kernel

Hey guys

How does one go about profiling the code executed inside of a kernel?

I have a single kernel function that is launched, which in turn launches various static device functions.

Trouble is the nv profiler only has stats for the entire kernel launch - I want the GPU time spent between calls inside of the kernel code.

Maybe I can use clock() to help me somehow? I’m not sure how though…

Check out the clock SDK example. It uses clock() calls inside the kernel and stores the results to device memory. You should be able to do the same sort of thing to time your different device functions, store them in global memory, copy the timings back to the host and print them out there…