Is there any way to solely collect the total duration of the CUDA kernels within each nvtx range

Hello! I would like to profile the forward, backward, and optimizer of my model. I have added some nvtx annotations in my code so that I can observe the forward, backward, and optimizer in the timeline of Nsight Systems. Though I am able to obtain the time consumed by the forward, backward, and optimizer on the CPU and each CUDA stream, I am unable to exclude the idle time of the GPU. Is there any way to solely collect the total duration of the CUDA kernels within each nvtx range, enabling me to determine the precise computation time of the forward, backward, and optimizer?

I think the best bet for doing that would be to export the data to sqlite and create a script to sum up only the time taken up by CUDA kernels in those NVTX ranges. You can probably start from our provided cuda_api_gpu_sum script and modify if needed. See User Guide :: Nsight Systems Documentation for details. (forums gives the top level name, but the link is to the correct section).

Also, just as a general note, if the gaps are large on the GPU side, I am not sure that you should not be trying to remove those gaps before you analyze individual phases. You might want to take a look at the memory transfers for example. See https://developer.nvidia.com/blog/optimizing-cuda-memory-transfers-with-nsight-systems/ for some examples.