Concurrent CUDA kernel scheduling on Fermi GPU's


I have a GTX 480 on which I want to run some
CUDA kernels concurrently.

Is there any specific compile options I need in
addition to making sure the kernels I wish to
run are in separate stream contexts?

I have tried to run the same kernel code operating
on different data regions in different stream contexts
so as to have them scheduled concurrently. But when analyzing
the streams in Parallel Nsight they seem to be serialized
instead of being run concurrently.

Any help gratefully received.

Even I am facing the same problem. I am trying it on Fermi C2050, and the concurrentKernels example given in the SDK shows that the kernels are launched sequentially. Attached is a snapshot from computeprof. One quick question, which OS and driver/CUDA version are you using. Are you able to launch nvidia display driver correctly? Not sure if the issue is related to that. Also, would like to know if there are some compile flags to be added.

computeprof ( and I am pretty that Parallel Nsight also ) serialize kernel runs

To prove that kernels are run concurrently, you cannot use those tools but need to do your own time computation instead

I just made a little experiment, and for 2 concurrent kernels I’m seeing only 20% time difference.

Is there any other way to check if the kernels run concurrently? Perhaps you could also post your timing code to compare.