The asyncApi example attempts to demonstrate that the cpu and gpu can run in parallel. Running this example on my machine (details below) shows that the cpu is not running in parallel with the gpu. Just to be clear, “not running in parallel” means that the reported times for “time spent executing by the GPU” and “time spent by CPU in CUDA calls” are roughly equivalent.
Can anyone run this sample and get results where the cpu spends much less time in cuda calls than the gpu spends executing?
My OS is Win 7 64 bit, CUDA toolkit 3.1, driver 257.21. I have both a GT240 card and integrated graphics. The non-parallel behavior is the same with either gpu.
I do not have that synchronous kernel launch macro defined. This is not a problem with the timing method.
Attached are the results when running asyncApi, and then the output of the “deviceQuery” sample.
Any advice would be greatly appreciated! This is killing performance…
asyncApiResults.txt (1.85 KB)