I tried Cuda with MPI on one node: Using MPI, the eight CPU processes launch the same task (an SIMD task) on the same 480 GTX. The whole solution using Cuda+MPI is faster than using only one CPU process with the GPU. Consequently, instead of using MPI, I tried another solution based on Cuda streams, but this solution does not work because “I think” I completly saturate the shared memory when using only one stream, thus I think that using various streams suppose that we have sufficient memory for all the streams.
My questions are the following: Why can I have a speedup (when using 8 processes instead of one, I have a speedup greater than 3) when using Cuda+MPI when it is not possible to have a similar speedup with one CPU process that uses different streams? How the eight CPU processes access to the GPU?
I will be very grateful for your responses