"idle time" between kernel calls ( from NVVP inspection)

Hi,

I have an application that runs 15-20 short lived kernels in a row ( all less than 1 ms of running time), looking at the visual profiler output serially things might look like:

H2D transfer
[idle time : ~10 us]
kernl_1
[idle time : ~10 us]
kernel_2
[dead time : ~10 us]

kernel_X
[idle time : ~2500 us]
kernel_Y
[idle time : ~10 us]

kernel_20

I’m trying to understand why between two certain kernels ther appears to allways be a 2-2.5 ms “idle time” when nothing is happening ( no memory transfers and no kernel is executing).

I can live with the occasional 0.5 ms idle time BUT not a consistent 2.5 ms, which appears to allways happen at the same spot.

Executing the whole set of kernels takes less than 20 ms so it shouldnt be due to WDDM reasons right?

Thanks for yout input!

In the answer cuda - Difference in time reported by NVVP and counters - Stack Overflow I detail how time is calculated in the current profilers.

A empty space of < 10µs between launches is likely due to the kernel launch configuration setup. It is likely that the GPU is not idle during this time. As listed in the stackoverflow post the CUDA profilers do not include launch configuration overhead in the duration of the kernel.

WDDM drivers have additional launch overhead on the CPU side due to queuing in the CUDA driver and various latencies in the WDDM kernel mode driver. An empty space on the timeline of several milliseconds usually indicates stalling due to queuing in the CUDA driver (only on WDDM) or the presence of another context (probably graphics) executing on the GPU.

Nsight Visual Studio Edition 3.0 RC has two new features in the Analysis Trace Activity to help identify the cause of the latency:

  1. Windows Display Driver Mode Trace

This captures events in the WDDM driver. These events include

  • queuing and execution of command buffers
  • memory copy to/from device memory
  • paging of memory from host to device memory

This information is captured for all processes in the system. A command buffer can contain multiple CUDA commands including kernel launches, memory copies, and memory sets.

This information is displayed in the Timeline report page under the node System\GPU Usage and System\GPU Stats.

In the Analysis Activity under Trace Settings

  • Enable System
  • Expand System and enable Windows Display Driver Model Trace\WDDM Base Events
  1. CUDA Driver Queue Latency

This features tracks the depth of the CDUA driver queue (WDDM only) and kernel/hardware queue.

VIEWING THE DATA
This information is displayed as a graph in the Timeline report page under the CUDA\Context #\Counters node. The positive y-axis of the graph is the depth of the CUDA driver queue. The negative y-axis of the graph is the depth of the kernel/hardware queue.

The submit time, queued time, and latency is also available in the CUDA Launches report page.

COLLECTING THE DATA
In the Analysis Activity under Trace Settings

  • Enable CUDA
  • Expand CUDA and enable Kernel Launches and Memory Options and Driver Queue Latency.

This feature adds ~2µs each time work is submitted from the software queue to the kernel/hardware queue so it is only advisable to enable when you are debugging WDDM queuing issues.

Greg, thank you very much for your detailed answer.

I am doing a graphics update on a separate thread which reads an output buffer produced by the “worker thread” detailed above.

Lowering the number of updates that the display thread is doing seemed to mitigate the problem a bit, however I’m still running into 1.5 - 3 ms stalls both in the temporal vicinity of a graphics updated and when there should be nothing else going on.

You can see my images below to get an idea:


I’ve performed an anlysis using Nsight 3.0 RC.

  • Every 10 ms there is WDDM update activity
  • This coincides with my NVVP results where the temporal distance between each 1.5-3ms gap is 10 ms.
  • My OpenGL graphics updates are every 30 ms and if i remove the visualization is still see the above mentioned gap in the same pattern.

Someone with more knowledge could advice me ?

Thanks so much,

Ok, discovered that there was a tiny 4 byte memcpy H2D at the end of the 1.5-3 ms gap. The memcpy was so small that I did not mangae to spot it in NVVP compared to all the other kernels.

The memcpy was synchronous and hence causing the whole kernel queue to be forced to wait.

EDIT: Btw when in doubt as to what the problem is, always question yourself ;)