Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU?

Hi,

I tried Cuda with MPI on one node: Using MPI, the eight CPU processes launch the same task (an SIMD task) on the same 480 GTX. The whole solution using Cuda+MPI is faster than using only one CPU process with the GPU. Consequently, instead of using MPI, I tried another solution based on Cuda streams, but this solution does not work because “I think” I completly saturate the shared memory when using only one stream, thus I think that using various streams suppose that we have sufficient memory for all the streams.

My questions are the following: Why can I have a speedup (when using 8 processes instead of one, I have a speedup greater than 3) when using Cuda+MPI when it is not possible to have a similar speedup with one CPU process that uses different streams? How the eight CPU processes access to the GPU?

I will be very grateful for your responses

So the question really boils down to why running 8 separate contexts on one device simultaneously is faster than running a single context with streams instead?

The sane answer is probably that it should never be the case (given your streams implementation is not completely wrong). The thing that immediate springs to mind is that you might be doing this on an insane platform, like Windows Vista or 7. Otherwise it would be pretty hard to explain.

I use Linux and and I have a huge speedup using MPI (400%) and I succeeded to have a small speedup with streams (11%), however I still don’t understand how can I have such an important speedup with MPI. Do you know how different CPU processes access to the GPU?

Are you sure that the GPU is the bottleneck and not the CPU?

Multiple processes access the GPU serially, with context switching overhead in a completely non-deterministic fashion. Which is why your example is so counterintuitive. It suggests something other than the GPU is the performance limit of your code, like single threaded cpu performance or I/O.

The majority of computations (95%) are done on GPU. Besides, is there any chance that the execution of the kernel by the GPU is faster than launching the kernel by the CPU?

Thank you again.

Very unlikely. Linux kernel launches only take about 15 to 20 microseconds. Context establishment takes of the order of 100 milliseconds. So the sum of those is about the minimum overhead per process.

If I understand you correct, and 95% of your computations are done on the GPU and the GPU is more than 20 times faster than the CPU, then the CPU indeed is the bottleneck. Check Amdahl’s law.

Using the Cuda-profiler, I think that I find the reason of my MPI speedup. This is due to a non-coalescent access of the GPU threads to the memory which is reduced when increasing the number of the logical processes of the CPU.

Thank you again for your responses.

How would the performance impact of uncoalesced memory accesses be any different by running kernels in different processes compared with running them in the same context with streams?

I think that the title “Cuda streams vs Cuda+MPI” is no more justified. Indeed, the comparison between using streams and MPI was justified from the execution of kernels point of view. However, when using MPI, I reduce the uncoalesce of the data treated per kernel (fact that I neglected before using the CudaProfiler) which was not the case when using streams. In my actual program MPI and streaming play a complementary role, by MPI I reduce the uncoalesce of the data and using streams I saturate the GPU by the maximum of instructions.

Are you telling us that the CUDA+MPI version is faster because you optimized the kernels since you’ve tried the single threaded version, and you had forgotten about those changes before posting the question here? In that case, I would sincerely recommend looking at some good revision control software to keep track of the changes to your code.

Oh no, I did not forget but I neglected: To some extent, I have always thought that if you give more data to the GPU you will have a better speedup than giving less data and launching more kernels. In the last example that I have treated it was not the case because the data of the big problem are badly distributed in the memory when the data of the small problems are very good distributed. This is why I was not aware of this uncoalesce problem until I used the CudaProfiler, it was not simple because I run a nonlinear Monte Carlo simulation with coupled paths.

The correct word is “underestimate” and not “neglect”