Parallelizing Thrust across multiple GPUs (and with cuFFT)

Hi all,

In my research group, discussion came up about wanting to extend our capabilities to use multiple GPUs in our research. The core of it uses cuFFT and a lot of element wise operations (though not all!).

I am curious if it is possible to use thrust across multiple GPUs. So let’s say that we have a very long positional array for all the positions of the atoms.

  1. Is there a way via thrust to break up it across the N gpus?
  2. device_host<thrust::complex> can be recasted to work with cuFFT because of the identical memory layout. Is there a way to use cuFFTMp with thrust vectors across GPUs? Does this start going to MPI territory>

Without modification to thrust itself, a thrust vector cannot span multiple GPUs. The way to do this with thrust is to create one vector per GPU, using methods that are basically identical to ordinary CUDA multi-GPU methods. Just like with ordinary CUDA, where you would launch 4 kernels to use 4 GPUs, with thrust you would need to launch 4 thrust::transform to use 4 GPUs. And this by no means addresses every conceivable question on the topic.

Regarding your second question, please post this question on the forum for CUFFT questions. For questions specifically about CUFFT or CUFFTMp, please ask those on the CUFFT forum.

Thanks Robert. Much appreciated.

Do you know any good resources for multi-GPU programming (either with or without thrust)?

There are some multi-GPU programming concepts scattered in the latter sections of this online training course particularly sections 7,10,11.

Multi-GPU programming is covered in a number of GTC presentations, for example here is one and here is one. (google is your friend, add the “gtc” keyword to your google search.)

NVIDIA offers a multi-GPU CUDA DLI course (Currently only available to groups in an instructor-led format, not available “on-demad” AFAIK). NVIDIA is also offering a multi-node course “to the public” at GTC Fall 2022 (in september). See here. The multi-node course includes multi-GPU concepts (pretty much expected as a prerequisite, that is the previous DLI course I mentioned) and focuses on MPI and NVSHMEM to “coordinate” the multi-GPU/“multi-Node” work. The strongest emphasis is on NVSHMEM.

(Later:) There is an on-demand DLI course that covers multi-GPU programming, see here.

Here is another excellent resource, a collection of examples.

Very much appreciated.

Can you confirm the link for “this online training course” is correct? I can’t click on the link from my end.

Fixed link, sorry.

No worries. Thank you and have a good rest of your day!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.