multi-GPU in cuda 4

Hi all,

I have seen that CUDA 4 enables for a single CPU thread to handle multiple GPUs. When I run a simple test program, I have observed that only one kernel is executed at a time. the kernel launch for the second GPU waits until the kernel finishes executing on the first GPU. This does not happen with one thread per GPU (CUDA 3.x style). I have two GTX 285 GPUS. and CUDA release 4.0, V0.2.1221

It is not clear to me if the problem comes from the cards (compute capability 1.3), from the code or what

Any help will be appreciated,

The code looks like this

int N;

cudaGetDeviceCount(&N);

for(int i = 0; i < N; i++) {

  cudaSetDevice(i);

/* do some cudamalloc / cudamemcpy */

mykernel<<<>>>();

}

Hi!

I would assume that the blocking part is in the cudaMalloc/Memcpy part you did not post.

Rearranging the code should remove the block:

int N;

cudaGetDeviceCount(&N);

for(int i = 0; i < N; i++) {

  cudaSetDevice(i);

/* do some cudamalloc / cudamemcpy */

}

for(int i = 0; i < N; i++) {

  cudaSetDevice(i);

  mykernel<<<>>>();

}

regards Rolf

[/quote]

[/quote]

Thanks for your response, but still not working. I am pretty sure now that the problem comes from the fact that a single cpu thread cannot handle 2 GPUs at the same time in my system. Maybe it works only on fermi cards. I am not able to find it in the documentation

Regards,

Have you tried using streams?

Good point. I didn’t. The problem is that I should use page-locked memory for cudaMemcpyAsync, but I use a third party library that uses pageable memory. I could probably lock the memory by myself with mlock() but I didn’t try it since the code works with multithreading,

Thanks,