Preventing concurrent kernel execution? Can this be done without cudaThreadSynchronize?

I have a really quick question. Is there a way on Fermi to prevent concurrent kernel execution from occurring without the overhead of a cudaStreamSynchronize()? i.e., is there an option to prevent this from ever being issued, even though it is unlikely to ever occur? The “threat” of concurrent kernel application leads to a potential race condition in my application which uses cuda streams. This was not the case on the GT200, so there I don’t need to invoke the synchronization. Although unlikely to occur, I want to guarantee I get the correct answer. Using the synchronization costs me as much as 10% of sustained performance, which is hugely significant in this case.

cudaStreamWaitEvent is the best way.

why not issue all kernel launches on the same stream? You could even define a macro to switch back and forth.

Thank you Tim for the suggestion. This is exactly what I needed, and with cudaStreamWaitEvent performance is now identical to the naked kernel performance.

Regarding the suggestion to use the same stream for all kernel execution, yes this would solve the problem of concurrent execution but creates a new race condition between the host->device transfers and the kernel execution. Whether I use the same stream for all kernel execution, or associate a stream with a given memory buffer, the solution in both cases is to use cudaStreamWaitEvent as it is much more lightweight than cudaDeviceSynchronize().