Profiling __device__ functions

Is there any way or any alternative work-around to profile device functions in Visual Profiler?

Thanks in advance!

I’m not aware of a way to truly profile it. What I usually do is call the function twice (making sure the first call is not rendered ineffective and optimized away), and take a look at its effect on the overall profiling of the kernel.

Profiling device functions is also not well-defined because the compiler tries to inline all device functions when possible. The body of the device function is then mixed right into the caller and optimized along with the rest of the code, so there is no clear “start” and “end” of the device function code.

Ailleur’s method is also what I use as well to get a sense of how device code contributes to the overall run time.

I can’t speak for Visual Studio. But try the CUDA 6.0 RC NVVP in linux. The best (and least advertised) new feature in CUDA 6.0 is that the profiler can show you line-by-line information on how much time is spent AND how divergent each line is.

“the profiler can show you line-by-line information”

You can do that too with Nsight (both Eclipse and Visual Studio editions). It can display the number (dynamic, not static) of warp instructions executed for each line of code. This is pretty useful, but I still need to look at the assembly code when something takes much longer than expected. Also, I couldn’t figure out if there was a way to organize the profile by function instead of just line by line? You could easily miss an important function if you just look at 1 file.

Where is this feature? The profiler looks pretty much the same to me.