Losing 800us to PCIe latency per Kernel launch Looking for tweaks and optimizations to minimize PCIe

I have been using the NVIDIA Compute Visual Profiler to monitor and measure CUDA calls and timing. I am losing 800 us after each CUDA cuLaunchGrid call. I have looked at my own code and many of the SDK examples. It is always the same, there is about 800 us of idle or lost time after kernel launches, whereas there is only 5 to 20 us of idle or lost time after memory transfers. Can anyone say for sure the root-cause of this lost time? More importantly can anyone suggest ways to tweak my system to minimize this lost time? It is puzzling to me why the kernel launches lose so much more time than memory transfers. And looking at the timing charts in the Compute Visual Profiler, it does appear that the lost time happens after the kernel completes, not as setup time for preparing the launch.

I am using CUDA toolkit 3.2.16, video driver 260.93, Windows 7 Professional 64-bit, and I have both a GeForce GTX 580 and a Tesla C2050 (set to TCC mode) for CUDA processing, while using a GeForce 9800 GT for video. System hardware: Intel i7-960 (3.2 GHz), 12.0 GB RAM, EVGA X58 Classified 4-Way SLI.

I have a number of kernels that I am trying to launch in series, for example:
err = cuMemcpyHtoD((CUdeviceptr)cuda_in_ptr, host_in_ptr, in_block_size);
err = cuLaunchGrid(A_Kernel, 150, 1);
err = cuLaunchGrid(B_Kernel, 120, 1);
err = cuLaunchGrid(C_Kernel, 190, 1);
err = cuMemcpyDtoH (host_out_ptr, (CUdeviceptr)cuda_out_ptr, out_block_size);

With Compute Visual Profiler, I am observing the time it takes to do each of the above steps, and I am seeing large blocks of idle time between the Launch Grid’s.

For example:
Function _ Duration(us) _ Idle Time(us)
cuMemcpyHtoD ______ 1.2us ___ 8.29us
cuLaunchGrid(A) _____ 3.7us _ 789.74us
cuLaunchGrid(B) _ 11711.2us _ 780.65us
cuLaunchGrid(C )____ 41.9us _ 782.82us
cuMemcpyDtoH ______ 6.4us __ 21.01us

I observe almost identical results with both the Geforce GTX 580 and a Tesla C2050 (except with the GTX 580, the cuLaunchGrid(B) the duration is only 8683.1 us).

Waiting 770 to 820 us of idle time between kernel launches seems like a long time - it is about 20% of the above timeline.

I also tried cuLaunchGridAsync(…) into stream 0 and got the same sort of performance. And I read on another message discussion to try using cuStreamQuery(0); after the launch to avoid batching of the kernel launches, but got the same results also.

I also tried a number of the SDK samples and observed the same 800 us delays. In one example, each kernel took about 800 us to execute and then 800 us of idle time between kernel launches, so a full 50% was lost to idle time.

My questions are what is the source of the above delays (Windows 7 Professional OS, or waiting on PCIe latency, or something else) and more importantly, any suggestions on things to tweak to minimize these delays?

I’m seeing the same behaviour… Did you figure out if this is caused by the Visual Profiler’s statistics gathering, or if this large latency is also present during normal CUDA execution?