I’m looking for some advice.
here’s the situation:
- I create a cuda stream
- I do some cudamemsets and cudamemcopyasyncs
- I launch a kernel
- within a loop I launch two other kernels a couple of times
all kernels and memcopys are enqueued in the cuda stream.
The problem is that I expect all cuda calls to be asynchronous and I need the CPU to do other stuff for me while the GPU is busy executing the above kernels one after another.
Buf for some reason, the first time I launch the second kernel in the loop (step 4) it synchronizes so that cudaQueryStream returns “cudaSuccess” afterwards.
There are no other threads that do cuda stuff. Btw, Im running the code on a 8800GTX.
What might be the reason that the kernel synchronizes this one time, whereas it works aysynchronously (as it should be) all the other loop iterations???