streamed kernel syncs when it shouldn't ...or should it?

Hey guys,
I’m looking for some advice.

here’s the situation:

  1. I create a cuda stream
  2. I do some cudamemsets and cudamemcopyasyncs
  3. I launch a kernel
  4. within a loop I launch two other kernels a couple of times

all kernels and memcopys are enqueued in the cuda stream.

The problem is that I expect all cuda calls to be asynchronous and I need the CPU to do other stuff for me while the GPU is busy executing the above kernels one after another.

Buf for some reason, the first time I launch the second kernel in the loop (step 4) it synchronizes so that cudaQueryStream returns “cudaSuccess” afterwards.

There are no other threads that do cuda stuff. Btw, Im running the code on a 8800GTX.
What might be the reason that the kernel synchronizes this one time, whereas it works aysynchronously (as it should be) all the other loop iterations???

I read somewhere that there’s a limit to how many items can be queued up asynchronously. I think that limit was 16, or maybe it was 8. Do you think that by the time you’ve hit that kernel invocation in the loop that it could have filled such a queue and be blocking?


that’s a good possibility and the code does behave differently when I comment out some of the memcopies or the first kernel launch. Can you remember where exactly you read that information? It seems to me that the programming guide 2.0b does not mention a limit.

MisterAnderson42 benchmarked it directly and reported it in the forums:…ndpost&p=414419