The issue is that there is a certain time gap between the completion of one kernel’s execution and the launch of the next kernel. From the profiling perspective, I can’t determine what is happening during this time.
How can I analyze what is causing the time consumption during this period? Note: This part of the computation shown in the diagram is a simple encoder layer.
I found that part of the gaps is caused by the host time consumed by torch.cuda.nvtx, but the remaining part still cannot find the specific cause of the time consumption. Is there any method to profile the host time?
You may get better help with pytorch questions by asking on a pytorch forum, such as discuss.pytorch.org. There are NVIDIA experts that patrol that forum. For profiler questions, you can ask on one of the profiler forums.
From a development perspective, I would find out (via source code inspection) what is happening around the time or prior to the kernel launch(es) in question, then start using nvtx myself to mark ranges of activity and see what shows up when I re-profile the code. You can use a hierarchical/binary-search type approach to divide and conquer, and zero in on a particular set of activity, fairly quickly.