I would like to write some application which would perform N kernels on N devices concurrently. How to do it? I understand there’s one way using threads, like in “simpleMultiGPU” example in SDK. But this feature (cutStartThread) is not documented at all and I don’t even know how to construct a TGPUplan structure. (Does it have to be always in the same form as in example or what are the constraints for TGPUplan anyway?) It seems to me using streams may be useful but I don’t know how to handle more than one device. Can someone help me? I would like to write something like this:
Prepare_some_data_on_host;
for (i = 0; i < N; i++) { // we have N devices
Copy_ith_part_of_data_from_host_to_ith_device;
}
Wait_for_data_transfers_to_be_complete;
for (i = 0; i < N; i++) {
Start_kernel_on_ith_device;
}
Wait_for_kernels_to_be_complete;
for (i = 0; i < N; i++) {
Copy_data_from_ith_device_to_host;
}
Wait_for_data_transfers_to_be_complete;
To clarify things a bit, you need a host thread for each device in the system if you want to do multi-GPU, so that each host thread can handle the data transfers and kernel launches for whatever device it’s controlling.