Multi-GPU computing How to perform N kernels on N devices concurrently?

Hello,

I would like to write some application which would perform N kernels on N devices concurrently. How to do it? I understand there’s one way using threads, like in “simpleMultiGPU” example in SDK. But this feature (cutStartThread) is not documented at all and I don’t even know how to construct a TGPUplan structure. (Does it have to be always in the same form as in example or what are the constraints for TGPUplan anyway?) It seems to me using streams may be useful but I don’t know how to handle more than one device. Can someone help me? I would like to write something like this:

Prepare_some_data_on_host;

for (i = 0; i < N; i++) { // we have N devices

  Copy_ith_part_of_data_from_host_to_ith_device;

}

Wait_for_data_transfers_to_be_complete;

for (i = 0; i < N; i++) { 

  Start_kernel_on_ith_device;

}

Wait_for_kernels_to_be_complete;

for (i = 0; i < N; i++) { 

  Copy_data_from_ith_device_to_host;

}

Wait_for_data_transfers_to_be_complete;

Take a look at MrAnderson’s GPUWorker class. Lots of people have used it here to easily add multi-GPU support to their programs:

http://forums.nvidia.com/index.php?showtopic=66598

To clarify things a bit, you need a host thread for each device in the system if you want to do multi-GPU, so that each host thread can handle the data transfers and kernel launches for whatever device it’s controlling.

Thank you very much, this looks perfect.

I’ve been using openMP for this, though my (very poor) understanding is that posix threads (pthreads.h) will work too.