You mentioned MPI; to my knowledge, not all devices support it, though
The alternative to mpi would of course be shared memory; the concept of shared memory is hardly unique to gpus and their SMs, nor confined to it, I would think
I have had multiple embarrassingly parallel instances of the same algorithm run on the same device, talking to each other via ‘global shared memory’ - global memory that is shared by the instances
And I think I would have little difficulty extending this to a single host with multiple devices, and then multiple hosts, each with multiple devices
Using shared memory instead of mpi would allow me to stick with devices like a gtx 780 ti, which have exceptional cost / FLOP economy, in my view
Are you merely considering multiple gpus for the sake of fitting/ loading all the data on devices, or do you equally consider multiple gpus to also achieve performance objectives?
Surely you would not need all the data at once, implying that it can be broken up in blocks, and processed on a block-basis, perhaps subsequently consolidating block-results…?