IO and Execution Pipelining

Hi, is it possible in anyway to pipeline data transfer to/from the card and the execution of a kernel. Even if it requires separate CUDA contexts? I know that within a single context, memory transfer operations block so its not possible in a single threaded cuda app.


This is not possible in CUDA 1.0 because G8X GPUs are not designed for this, but it will be supported in future hardware and a future release of CUDA.


Let me ask a question along these same lines:

We have been looking into running short “bursty” programs on the GPU and the startup time for a CUDA program is killing us.

Can the hardware support “pinning” a program into GPU global memory so that it can be started faster? I know this is not possible with the current release of CUDA, but can the hardware do it?

Is the long startup time now a limitation of hardware or the kernel driver?

Thanks Mark. Looking forward to it. It will be a massive performance improvement for heavy IO GPGPU tasks!


More details would be helpful here. From your perspective, what is causing the long startup time? Is this the time from when a launch is requested to when it begins executing?

(Note, the GPU code is loaded when a module is loaded, which corresponds to cuModuleLoad in the driver API or loading of a link target including a .cu file if using CUDART.)

By startup time I mean the amount of time it takes between the host C code calling a CUDA function and when CUDA returns to the host C code.

We are measuring a startup time of 0.7 milliseconds, when the amount of work being done on the program is often 0.5 milliseconds and can be done on the CPU in 1 millisecond.

We also confirmed, per other postings on the forum, that the first program launch takes significantly longer then the following launches.

A few questions:

  1. do you have the profiler enabled?

  2. is the 0.7ms startup time for a single launch, or a sum over several launches?

2a) if you are timing multiple launches, does anything (such as grid configuration) change from launch to launch?

  1. are you using run-time or driver API?

  2. is the kernel in question using textures?


After reading through my post from yesterday again it seems that I was quite vague. We are measuring 11 launches of an empty kernel, throwing out the measurement of the first launch. We are recording launch #2-11 to make up for the fact that the first launch takes 10 times longer then the subsequent ones. After accounting for the 10 launches, we found that an empty kernel takes 0.07ms to launch.

I also noticed during further testing that if I run the test directly after cold-booting the workstation running the tests the numbers drop to 0.015ms/launch. I’m not quite sure why the times are faster if I cold-boot the (Suse Linux 10.2) workstation directly before running the test. I’m assuming it is either a hardware quirkiness or driver bug…

An empty function call in CPU code takes less time then the resolution of the system timer (0.001ms), so calling a CUDA function takes more then 15 times that of a CPU function directly after cold boot.