What are possible reasons of heavy kernel launch latency?

GPU kernel launch latency (the time from when the CPU code encountered the kernel launch in your source code, until the time when the kernel was actually processing) could be impacted if:

  • the GPU is busy with other work &
  • the launch queue is full
  • the CPU is busy or heavily loaded
  • a synchronizing operation is needed, for example with lazy loading &
  • in a multi-threaded application, due to competition for internal resource locks
  • there is a varying or large parameter pack (data size of the arguments passed to the kernel) &
  • the GPU is in default compute mode, and there are other users of the GPU (which also includes other containers) &

if we focus our attention only on the latency between when the launch was actually requested (roughly, the completion of the bar in the API section) and when the kernel actually began processing (the start of the bar in the device section), then the list is shorter. I have marked those above with &. And of course there are probably others that I don’t know or haven’t remembered.

Without the code and access to the profiler interactively, I don’t think I can offer any further advice.

1 Like