The asyncApi example attempts to demonstrate that the cpu and gpu can run in parallel. Running this example on my machine (details below) shows that the cpu is not running in parallel with the gpu. Just to be clear, “not running in parallel” means that the reported times for “time spent executing by the GPU” and “time spent by CPU in CUDA calls” are roughly equivalent.
Can anyone run this sample and get results where the cpu spends much less time in cuda calls than the gpu spends executing?
My OS is Win 7 64 bit, CUDA toolkit 3.1, driver 257.21. I have both a GT240 card and integrated graphics. The non-parallel behavior is the same with either gpu.
I do not have that synchronous kernel launch macro defined. This is not a problem with the timing method.
Attached are the results when running asyncApi, and then the output of the “deviceQuery” sample.
Any advice would be greatly appreciated! This is killing performance…
[asyncAPI]
CUDA device [GeForce GTX 275]
time spent executing by the GPU: 59.54
time spent by CPU in CUDA calls: 0.04
CPU executed 177598 iterations while waiting for GPU to finish
--------------------------------------------------------------
[asyncAPI] -> Test Results:
PASSED
Press ENTER to exit...
and a Mac OS X 10.6 laptop with CUDA 3.2:
[asyncAPI]
CUDA device [GeForce 320M]
time spent executing by the GPU: 131.51
time spent by CPU in CUDA calls: 0.19
CPU executed 213044 iterations while waiting for GPU to finish
--------------------------------------------------------------
[asyncAPI] -> Test Results:
PASSED
Press ENTER to exit...
I think you are a victim of WDDM and the command queue batching that the CUDA driver has to do to work around all the nonsense that WDDM imposes.
THANK YOU for posting this! I was starting to go crazy…
I think I’m a victim of poor developer support generally, and poor documentation specifically. The WDDM might make it hard to write efficient CUDA drivers, but it doesn’t make it hard to write accurate CUDA documentation.
But you’re right in that the issue here is the CUDA developers’ response to high kernel launch overhead under WDDM. The issue is discussed in several other threads, but for the benefit of anyone unfortunate enough to follow in my footsteps:
Kernel launches and other GPU commands are sent immediately to the GPU on Linux, OS X, and Win XP. On Vista and 7, they get stuck into a queue until driver (cuda runtime) decides to flush the queue to the gpu. So when the docs say “kernel launches are asynchronous” they’re forgetting to add “except on recent versions of the niche OS called MS Windows, in which case we have a fun little undocumented surprise for you!” One solution is to create an event, record and then query it just after the kernel launch. This will cause the internal queue to be flushed.
To my knowledge, as of Cuda 3.2, this behavior is not documented anywhere but the forums. The docs are flat-out wrong.