Hi all… quite frustrated by this one.
I do something like:
I time each of these three statements.
When useCpu() is empty and does nothing, I get 0.16 ms, 0.00 ms, and 3.81 ms, respectively. Unfortunately, when useCpu() does something useful I get 0.16 ms, 5.37 ms, and 3.81 ms… Since useCpu() takes longer than the kernel execution, I would expect cudaThreadSynchronize() to return immediately, but it’s not.
I’m puzzled that the kernel launch returns before the kernel finishes, as advertised, but execution seems to immediately stall and not continue until cudaThreadSynchronize() is called. What am I missing? Do I need to do something to really enable asynchronous execution?
This is on Win7 using the 3.0 toolkit, if it matters.
Any suggestions would be greatly appreciated!