Hey, I have searched for a solution to this but can’t find a definitive answer.
It seems as if NVTX is able to annotate cuda behaviour and provide the necessary runtime statistics, but only for API calls. Am I missing something? Can the kernels be annotated in this manner? And if so, does anybody have a resource that shows examples on how to do this?
What I am trying to do and why:
I have a looping sequence of kernels that refines an answer until a “desired threshold” has been met. The ideal number of times the sequence of kernels gets called depends on the data.
Originally, I had the CPU just loop on the sequence and the kernels would themselves early out if there was no work to be done. The problem was that the CPU would issue say 100 loop iterations to the GPU, but if the data found an answer after 3 iterations, the stream would have to continue for the remaining 97 iterations. The kernels would early out, but they would also still be issued and executed. eg. For kernel sequence [A->B->C], the CPU would issue:
[A0->B0->C0]->[A1->B1->C1]-> … -> [A99->B99->C99]
Once the above was working, I changed the code to have the loop be issued in a kernel and use the dynamic parallelism features to have the actual iteration sequences be issued as child launches of the looping kernel. The looping kernel would then have access to the intermediate results and be able to determine if additional looping is necesssary (or if not exit). This also works.
The problem is that the profiler (nsight in my case) only tracks timing information for the kernel executing the loop and I no longer am able to track the timing of the individual sequence kernels being called FROM the loop kernel.
So instead of having detailed information on all the AN, BN and CN kernels all I see is one kernel launch of the kernel executing the loop that uses DP to launch the A->B->C sequences. I no longer have the times for A, B and C. More importantly I don’t know how many times the loop has actually executed.
Ideally I could annotate each of the A, B and C kernels to identify their start/end times (from the DP loop’s perspective) and have that reported to nsight so I could see how long they executed and how many times the loop iterated.