I have the following code:


#pragma omp parallel


     unsigned int cpu_id = OMP_GET_THREAD_NUM();

     cudaSetDevice( cpu_id );






Now I have the following question, will these two kernels launch concurrently on two separate cards?

not without explicit coding.

So I have to do it like in the multiGPU example of the cuda SDK, OpenMP is not enough?

oh wait, did not notice that.
Well, if you have declared all the variables on the right GPU too and moved the required memory on it, then probably that should work as far as I can tell, but I have never played with openmp, so don’t take my word for it.

it should work. if I saw OpenMP code like this in a program I’d probably kill you (hard-coding all of that? such a bad idea), but that’s neither here nor there…

Yeah I have everything set up for both cards.
I have also forgot to mention that bot kernels are running inside a loop.
I am asking this question because when I measure the execution time of this code, and the same code with one kernel call commented out. The one kernel call version runs almost exactly half the time of the two kernel version. It seems to me like the kernel calls are being serialized. So I would like an opinion from an expert.

It just a simple example not an actual application…

Yes, these will go to two different GPUs.

I think there is a cudaOpenMP sample in the SDK version for Windows.