In the answer http://stackoverflow.com/questions/12497619/difference-in-time-reported-by-nvvp-and-counters/12502143#12502143 I detail how time is calculated in the current profilers.
A empty space of < 10µs between launches is likely due to the kernel launch configuration setup. It is likely that the GPU is not idle during this time. As listed in the stackoverflow post the CUDA profilers do not include launch configuration overhead in the duration of the kernel.
WDDM drivers have additional launch overhead on the CPU side due to queuing in the CUDA driver and various latencies in the WDDM kernel mode driver. An empty space on the timeline of several milliseconds usually indicates stalling due to queuing in the CUDA driver (only on WDDM) or the presence of another context (probably graphics) executing on the GPU.
Nsight Visual Studio Edition 3.0 RC has two new features in the Analysis Trace Activity to help identify the cause of the latency:
- Windows Display Driver Mode Trace
This captures events in the WDDM driver. These events include
- queuing and execution of command buffers
- memory copy to/from device memory
- paging of memory from host to device memory
This information is captured for all processes in the system. A command buffer can contain multiple CUDA commands including kernel launches, memory copies, and memory sets.
This information is displayed in the Timeline report page under the node System\GPU Usage and System\GPU Stats.
In the Analysis Activity under Trace Settings
- Enable System
- Expand System and enable Windows Display Driver Model Trace\WDDM Base Events
- CUDA Driver Queue Latency
This features tracks the depth of the CDUA driver queue (WDDM only) and kernel/hardware queue.
VIEWING THE DATA
This information is displayed as a graph in the Timeline report page under the CUDA\Context #\Counters node. The positive y-axis of the graph is the depth of the CUDA driver queue. The negative y-axis of the graph is the depth of the kernel/hardware queue.
The submit time, queued time, and latency is also available in the CUDA Launches report page.
COLLECTING THE DATA
In the Analysis Activity under Trace Settings
- Enable CUDA
- Expand CUDA and enable Kernel Launches and Memory Options and Driver Queue Latency.
This feature adds ~2µs each time work is submitted from the software queue to the kernel/hardware queue so it is only advisable to enable when you are debugging WDDM queuing issues.