"idle time" between kernel calls ( from NVVP inspection)

_constant · December 6, 2012, 6:11pm

Hi,

I have an application that runs 15-20 short lived kernels in a row ( all less than 1 ms of running time), looking at the visual profiler output serially things might look like:

H2D transfer
[idle time : ~10 us]
kernl_1
[idle time : ~10 us]
kernel_2
[dead time : ~10 us]
…
kernel_X
[idle time : ~2500 us]
kernel_Y
[idle time : ~10 us]
…
kernel_20

I’m trying to understand why between two certain kernels ther appears to allways be a 2-2.5 ms “idle time” when nothing is happening ( no memory transfers and no kernel is executing).

I can live with the occasional 0.5 ms idle time BUT not a consistent 2.5 ms, which appears to allways happen at the same spot.

Executing the whole set of kernels takes less than 20 ms so it shouldnt be due to WDDM reasons right?

Thanks for yout input!

Greg · December 6, 2012, 11:02pm

In the answer cuda - Difference in time reported by NVVP and counters - Stack Overflow I detail how time is calculated in the current profilers.

A empty space of < 10µs between launches is likely due to the kernel launch configuration setup. It is likely that the GPU is not idle during this time. As listed in the stackoverflow post the CUDA profilers do not include launch configuration overhead in the duration of the kernel.

WDDM drivers have additional launch overhead on the CPU side due to queuing in the CUDA driver and various latencies in the WDDM kernel mode driver. An empty space on the timeline of several milliseconds usually indicates stalling due to queuing in the CUDA driver (only on WDDM) or the presence of another context (probably graphics) executing on the GPU.

Nsight Visual Studio Edition 3.0 RC has two new features in the Analysis Trace Activity to help identify the cause of the latency:

Windows Display Driver Mode Trace

This captures events in the WDDM driver. These events include

queuing and execution of command buffers
memory copy to/from device memory
paging of memory from host to device memory

This information is captured for all processes in the system. A command buffer can contain multiple CUDA commands including kernel launches, memory copies, and memory sets.

This information is displayed in the Timeline report page under the node System\GPU Usage and System\GPU Stats.

In the Analysis Activity under Trace Settings

Enable System
Expand System and enable Windows Display Driver Model Trace\WDDM Base Events

CUDA Driver Queue Latency

This features tracks the depth of the CDUA driver queue (WDDM only) and kernel/hardware queue.

VIEWING THE DATA
This information is displayed as a graph in the Timeline report page under the CUDA\Context #\Counters node. The positive y-axis of the graph is the depth of the CUDA driver queue. The negative y-axis of the graph is the depth of the kernel/hardware queue.

The submit time, queued time, and latency is also available in the CUDA Launches report page.

COLLECTING THE DATA
In the Analysis Activity under Trace Settings

Enable CUDA
Expand CUDA and enable Kernel Launches and Memory Options and Driver Queue Latency.

This feature adds ~2µs each time work is submitted from the software queue to the kernel/hardware queue so it is only advisable to enable when you are debugging WDDM queuing issues.

_constant · December 7, 2012, 2:44pm

Greg, thank you very much for your detailed answer.

I am doing a graphics update on a separate thread which reads an output buffer produced by the “worker thread” detailed above.

Lowering the number of updates that the display thread is doing seemed to mitigate the problem a bit, however I’m still running into 1.5 - 3 ms stalls both in the temporal vicinity of a graphics updated and when there should be nothing else going on.

You can see my images below to get an idea:

_constant · December 8, 2012, 5:36pm

I’ve performed an anlysis using Nsight 3.0 RC.

Every 10 ms there is WDDM update activity
This coincides with my NVVP results where the temporal distance between each 1.5-3ms gap is 10 ms.
My OpenGL graphics updates are every 30 ms and if i remove the visualization is still see the above mentioned gap in the same pattern.

Someone with more knowledge could advice me ?

Thanks so much,

_constant · December 10, 2012, 3:10pm

Ok, discovered that there was a tiny 4 byte memcpy H2D at the end of the 1.5-3 ms gap. The memcpy was so small that I did not mangae to spot it in NVVP compared to all the other kernels.

The memcpy was synchronous and hence causing the whole kernel queue to be forced to wait.

EDIT: Btw when in doubt as to what the problem is, always question yourself ;)

Topic		Replies	Views
Gap between some thread calls CUDA Programming and Performance	6	1260	October 30, 2014
Why is there a period of idle time between kernels or transfers，what happened during this idle time？ Visual Profiler and nvprof cuda	2	887	January 15, 2024
High idle times between kernel exeuction CUDA Programming and Performance	0	2145	April 18, 2012
Is WDDM causing this? CUDA Programming and Performance	3	5104	June 28, 2013
Inconsistent kernel run times CUDA Programming and Performance	12	5782	August 5, 2009
First kernel execution takes longer CUDA Programming and Performance	8	2850	December 8, 2014
cudaStreamSync and WDDM relation CUDA Programming and Performance	0	479	May 11, 2018
What the gaps on the nvvp pipeline mean? And how to shrink the gap size? CUDA Programming and Performance	6	747	September 15, 2019
Very slow kernel launches CUDA Programming and Performance	8	7697	March 28, 2015
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1430	July 17, 2017

"idle time" between kernel calls ( from NVVP inspection)

Related topics