asyncApi example -- does it work for anyone?

Hi all,

The asyncApi example attempts to demonstrate that the cpu and gpu can run in parallel. Running this example on my machine (details below) shows that the cpu is not running in parallel with the gpu. Just to be clear, “not running in parallel” means that the reported times for “time spent executing by the GPU” and “time spent by CPU in CUDA calls” are roughly equivalent.

Can anyone run this sample and get results where the cpu spends much less time in cuda calls than the gpu spends executing?

My OS is Win 7 64 bit, CUDA toolkit 3.1, driver 257.21. I have both a GT240 card and integrated graphics. The non-parallel behavior is the same with either gpu.

I do not have that synchronous kernel launch macro defined. This is not a problem with the timing method.

Attached are the results when running asyncApi, and then the output of the “deviceQuery” sample.

Any advice would be greatly appreciated! This is killing performance…

asyncApiResults.txt (1.85 KB)

This run on a linux box with CUDA 3.2:

[asyncAPI]

CUDA device [GeForce GTX 275]

time spent executing by the GPU: 59.54

time spent by CPU in CUDA calls: 0.04

CPU executed 177598 iterations while waiting for GPU to finish

--------------------------------------------------------------

[asyncAPI] -> Test Results:

PASSED

Press ENTER to exit...

and a Mac OS X 10.6 laptop with CUDA 3.2:

[asyncAPI]

CUDA device [GeForce 320M]

time spent executing by the GPU: 131.51

time spent by CPU in CUDA calls: 0.19

CPU executed 213044 iterations while waiting for GPU to finish

--------------------------------------------------------------

[asyncAPI] -> Test Results:

PASSED

Press ENTER to exit...

I think you are a victim of WDDM and the command queue batching that the CUDA driver has to do to work around all the nonsense that WDDM imposes.

THANK YOU for posting this! I was starting to go crazy…

I think I’m a victim of poor developer support generally, and poor documentation specifically. The WDDM might make it hard to write efficient CUDA drivers, but it doesn’t make it hard to write accurate CUDA documentation.

But you’re right in that the issue here is the CUDA developers’ response to high kernel launch overhead under WDDM. The issue is discussed in several other threads, but for the benefit of anyone unfortunate enough to follow in my footsteps:

Kernel launches and other GPU commands are sent immediately to the GPU on Linux, OS X, and Win XP. On Vista and 7, they get stuck into a queue until driver (cuda runtime) decides to flush the queue to the gpu. So when the docs say “kernel launches are asynchronous” they’re forgetting to add “except on recent versions of the niche OS called MS Windows, in which case we have a fun little undocumented surprise for you!” One solution is to create an event, record and then query it just after the kernel launch. This will cause the internal queue to be flushed.

To my knowledge, as of Cuda 3.2, this behavior is not documented anywhere but the forums. The docs are flat-out wrong.

Yes, I can. With Windows 7 x64, GTX260, Driver 260.93, CUDA 2.3:

time spent executing by the GPU: 29.44

time spent by CPU in CUDA calls: 0.04

CPU executed 19507 iterations while waiting for GPU to finish


Test PASSED

On the other hand, here’s the output from the CUDA 3.2 AsyncApi sample, on the same system:

[asyncAPI]

CUDA device [GeForce GTX 260]

time spent executing by the GPU: 29.35

time spent by CPU in CUDA calls: 31.86

CPU executed 18134 iterations while waiting for GPU to finish


[asyncAPI] → Test Results:

PASSED

I haven’t looked into this further, but I’m confident that GPU computation can overlap with the CPU - my application depends on it.

I can only suggest that you take a close look at the AsyncAPI sample in 3.2 to see what’s going on, and perhaps compare with the 2.3 version.

Good luck!

Alistair