NVTX kernel code

Hey, I have searched for a solution to this but can’t find a definitive answer.

It seems as if NVTX is able to annotate cuda behaviour and provide the necessary runtime statistics, but only for API calls. Am I missing something? Can the kernels be annotated in this manner? And if so, does anybody have a resource that shows examples on how to do this?

What I am trying to do and why:

I have a looping sequence of kernels that refines an answer until a “desired threshold” has been met. The ideal number of times the sequence of kernels gets called depends on the data.

Originally, I had the CPU just loop on the sequence and the kernels would themselves early out if there was no work to be done. The problem was that the CPU would issue say 100 loop iterations to the GPU, but if the data found an answer after 3 iterations, the stream would have to continue for the remaining 97 iterations. The kernels would early out, but they would also still be issued and executed. eg. For kernel sequence [A->B->C], the CPU would issue:
[A0->B0->C0]->[A1->B1->C1]-> … -> [A99->B99->C99]

Once the above was working, I changed the code to have the loop be issued in a kernel and use the dynamic parallelism features to have the actual iteration sequences be issued as child launches of the looping kernel. The looping kernel would then have access to the intermediate results and be able to determine if additional looping is necesssary (or if not exit). This also works.

The problem is that the profiler (nsight in my case) only tracks timing information for the kernel executing the loop and I no longer am able to track the timing of the individual sequence kernels being called FROM the loop kernel.

So instead of having detailed information on all the AN, BN and CN kernels all I see is one kernel launch of the kernel executing the loop that uses DP to launch the A->B->C sequences. I no longer have the times for A, B and C. More importantly I don’t know how many times the loop has actually executed.

Ideally I could annotate each of the A, B and C kernels to identify their start/end times (from the DP loop’s perspective) and have that reported to nsight so I could see how long they executed and how many times the loop iterated.

NVTX can only be used to annotate host code.

I’m pretty sure nvvp can track child kernel launches. When you say “nsight” it’s not clear if you mean nsight VSE i.e. windows, or nsight EE, i.e. linux.

This is what I see when I run nvvp (NVIDIA Visual Profiler) on cdpSimpleQuickSort sample code:


Ugh. There’s an option in visual studio to enable “Dynamic Parallelism Kernel Execution Trace”, which got turned off.

Thanks txbob, for pointing me in the right direction.