CUDA-aware MPI

Dear all,

I need to develop a GPU-MPI program to handle a huge memory problem that can’t be fit in one GPU. I have the following questions:

  1. Does GPU-MPI combination can be done using Kepler only?

  2. How can share the data among the GPUs? shall I deal with each GPU as a normal CPU?

  3. Shall I need more than one CPU core?

  4. Any resources?

Many thanks
Buraq

I would mention sections:

3.2.6.4. Peer-to-Peer Memory Access
3.2.6.5. Peer-to-Peer Memory Copy
3.2.7. Unified Virtual Address Space
3.2.8. Interprocess Communication

of the programming guide in passing

I would also mention that some functionality like MPI and peer-to-peer are supposedly not available on - enabled for - all devices (even though they are of adequate compute capability)
I almost want to tell you that you need a tesla for the mentioned functionality; in other words, just check whether your particular gpu support the functionality

Nevertheless, perhaps a workaround is possible
But, can you be more specific: do you have a huge problem, whose memory span can not fit a single gpu, or a huge problem that can not fit a single gpu? (According to me the 2 are not the same)

Do these really work? I also have a problem where I would use 20-50 GB of ram and It would make programming a little easy to use.

Thanks Jimmy for your response, I have a huge data to deal with; basically more than 7 GB and I’m currently using HPC support up to 6 GB per each GPU.

You mentioned MPI; to my knowledge, not all devices support it, though

The alternative to mpi would of course be shared memory; the concept of shared memory is hardly unique to gpus and their SMs, nor confined to it, I would think
I have had multiple embarrassingly parallel instances of the same algorithm run on the same device, talking to each other via ‘global shared memory’ - global memory that is shared by the instances
And I think I would have little difficulty extending this to a single host with multiple devices, and then multiple hosts, each with multiple devices
Using shared memory instead of mpi would allow me to stick with devices like a gtx 780 ti, which have exceptional cost / FLOP economy, in my view

Are you merely considering multiple gpus for the sake of fitting/ loading all the data on devices, or do you equally consider multiple gpus to also achieve performance objectives?

Surely you would not need all the data at once, implying that it can be broken up in blocks, and processed on a block-basis, perhaps subsequently consolidating block-results…?