I am profiling a Pytorch inference application. When I run nvidia nsys to profile the application, here is the snapshot as seen in nvidia nsight systems UI
That is what that should mean. I can’t tell from the screen shot, but if I were looking at this, I would look at the CUDA APIs on the CPU to see if there was a force synchronization causing this. It could also be that the CPU is not supplying enough work. You could check the OSRT data and CPU backtrace to see what was going on in this time frame.
Apologies for the delay. Thanks for the insight. Interestingly enough, there are “gaps” between two CUDA API calls in the CUDA API trace as well. The average “gap length” is 100 microseconds for a subset of the trace I observed. Do you have any suggestions as to the possible causes of this delay? I suspect the CPU is not sending work fast enough.
Thanks
Yeah it is difficult to say unless full context is known, which I have not given in the question. Apologies for that. But thanks a lot for the insights.