Trying to reduce delays between kernel launches

I am using CUDA toolkit 3.2.16, video driver 260.93, Windows 7 64-bit, and I have a Geforce GTX 580 and a Tesla C2050 (set to TCC mode).

I have a number of kernels that I am trying to launch in series, for example:
err = cuMemcpyHtoD((CUdeviceptr)cuda_in_ptr, host_in_ptr, in_block_size);
err = cuLaunchGrid(A_Kernel, 150, 1);
err = cuLaunchGrid(B_Kernel, 120, 1);
err = cuLaunchGrid(C_Kernel, 190, 1);
err = cuMemcpyDtoH (host_out_ptr, (CUdeviceptr)cuda_out_ptr, out_block_size);

With Compute Visual Profiler, I am observing the time it takes to do each of the above steps, and I am seeing large blocks of idle time between the Launch Grid’s.

For example:
Function _ Duration(us) _ Idle Time(us)
cuMemcpyHtoD ___ 1.2us _ 8.29us
cuLaunchGrid(A) ___ 3.7us _ 789.74us
cuLaunchGrid(B) _ 11711.2us _ 780.65us
cuLaunchGrid(C) ___ 41.9us _ 782.82us
cuMemcpyDtoH ___ 6.4us _ 21.01us

I observe almost identical results with both the Geforce GTX 580 and a Tesla C2050 (except with the GTX 580, the cuLaunchGrid(B) the duration is only 8683.1 us).

Waiting 770 to 820 us of idle time between kernel launches seems like a long time - it is about 20% of the above timeline.

I also tried cuLaunchGridAsync(…) into stream 0 and got the same sort of performance. And I read on another message discussion to try using cuStreamQuery(0); to avoid batching of the kernel launches, but got the same results also.

My questions are what is the source of the above delays (Windows 7 OS?) and how can I minimize the delays?