Hi, is it possible in anyway to pipeline data transfer to/from the card and the execution of a kernel. Even if it requires separate CUDA contexts? I know that within a single context, memory transfer operations block so its not possible in a single threaded cuda app.
After reading through my post from yesterday again it seems that I was quite vague. We are measuring 11 launches of an empty kernel, throwing out the measurement of the first launch. We are recording launch #2-11 to make up for the fact that the first launch takes 10 times longer then the subsequent ones. After accounting for the 10 launches, we found that an empty kernel takes 0.07ms to launch.
I also noticed during further testing that if I run the test directly after cold-booting the workstation running the tests the numbers drop to 0.015ms/launch. I’m not quite sure why the times are faster if I cold-boot the (Suse Linux 10.2) workstation directly before running the test. I’m assuming it is either a hardware quirkiness or driver bug…
An empty function call in CPU code takes less time then the resolution of the system timer (0.001ms), so calling a CUDA function takes more then 15 times that of a CPU function directly after cold boot.