I have a GTX 480 on which I want to run some
CUDA kernels concurrently.
Is there any specific compile options I need in
addition to making sure the kernels I wish to
run are in separate stream contexts?
I have tried to run the same kernel code operating
on different data regions in different stream contexts
so as to have them scheduled concurrently. But when analyzing
the streams in Parallel Nsight they seem to be serialized
instead of being run concurrently.
Even I am facing the same problem. I am trying it on Fermi C2050, and the concurrentKernels example given in the SDK shows that the kernels are launched sequentially. Attached is a snapshot from computeprof. One quick question, which OS and driver/CUDA version are you using. Are you able to launch nvidia display driver correctly? Not sure if the issue is related to that. Also, would like to know if there are some compile flags to be added.