Losing 800us to PCIe latency per Kernel launch Looking for tweaks and optimizations to minimize PCIe

Craig_481 · January 7, 2011, 3:06pm

I have been using the NVIDIA Compute Visual Profiler to monitor and measure CUDA calls and timing. I am losing 800 us after each CUDA cuLaunchGrid call. I have looked at my own code and many of the SDK examples. It is always the same, there is about 800 us of idle or lost time after kernel launches, whereas there is only 5 to 20 us of idle or lost time after memory transfers. Can anyone say for sure the root-cause of this lost time? More importantly can anyone suggest ways to tweak my system to minimize this lost time? It is puzzling to me why the kernel launches lose so much more time than memory transfers. And looking at the timing charts in the Compute Visual Profiler, it does appear that the lost time happens after the kernel completes, not as setup time for preparing the launch.

I am using CUDA toolkit 3.2.16, video driver 260.93, Windows 7 Professional 64-bit, and I have both a GeForce GTX 580 and a Tesla C2050 (set to TCC mode) for CUDA processing, while using a GeForce 9800 GT for video. System hardware: Intel i7-960 (3.2 GHz), 12.0 GB RAM, EVGA X58 Classified 4-Way SLI.

I have a number of kernels that I am trying to launch in series, for example:
err = cuMemcpyHtoD((CUdeviceptr)cuda_in_ptr, host_in_ptr, in_block_size);
err = cuLaunchGrid(A_Kernel, 150, 1);
err = cuLaunchGrid(B_Kernel, 120, 1);
err = cuLaunchGrid(C_Kernel, 190, 1);
err = cuMemcpyDtoH (host_out_ptr, (CUdeviceptr)cuda_out_ptr, out_block_size);

With Compute Visual Profiler, I am observing the time it takes to do each of the above steps, and I am seeing large blocks of idle time between the Launch Grid’s.

For example:
Function _ Duration(us) _ Idle Time(us)
cuMemcpyHtoD ______ 1.2us ___ 8.29us
cuLaunchGrid(A) _____ 3.7us _ 789.74us
cuLaunchGrid(B) _ 11711.2us _ 780.65us
cuLaunchGrid(C )____ 41.9us _ 782.82us
cuMemcpyDtoH ______ 6.4us __ 21.01us

I observe almost identical results with both the Geforce GTX 580 and a Tesla C2050 (except with the GTX 580, the cuLaunchGrid(B) the duration is only 8683.1 us).

Waiting 770 to 820 us of idle time between kernel launches seems like a long time - it is about 20% of the above timeline.

I also tried cuLaunchGridAsync(…) into stream 0 and got the same sort of performance. And I read on another message discussion to try using cuStreamQuery(0); after the launch to avoid batching of the kernel launches, but got the same results also.

I also tried a number of the SDK samples and observed the same 800 us delays. In one example, each kernel took about 800 us to execute and then 800 us of idle time between kernel launches, so a full 50% was lost to idle time.

My questions are what is the source of the above delays (Windows 7 Professional OS, or waiting on PCIe latency, or something else) and more importantly, any suggestions on things to tweak to minimize these delays?

lars · March 23, 2011, 1:55am

I’m seeing the same behaviour… Did you figure out if this is caused by the Visual Profiler’s statistics gathering, or if this large latency is also present during normal CUDA execution?

/L

Topic		Replies	Views
Trying to reduce delays between kernel launches CUDA Programming and Performance	0	6682	January 4, 2011
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	1	728	July 7, 2011
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	0	3772	January 13, 2011
Reducing GPU Idle Time CUDA Programming and Performance	19	4722	June 14, 2022
Kernel operation delays when gpu is idle Profiling Linux Targets cuda , kernel , python	10	637	March 20, 2024
Why Cuda Kernel Launch Takes so much time ？ CUDA Programming and Performance cuda , gstreamer	1	949	November 9, 2023
kernel launch latency CUDA Programming and Performance	16	8033	August 6, 2018
Time of cudaLaunch increase with the times of calling kernels. CUDA Programming and Performance	7	1262	September 12, 2017
Help in speeding up cuLaunchKernel execution time CUDA Programming and Performance	11	1158	October 28, 2022
"idle time" between kernel calls ( from NVVP inspection) CUDA Programming and Performance	4	5284	December 10, 2012

Losing 800us to PCIe latency per Kernel launch Looking for tweaks and optimizations to minimize PCIe

Related topics