Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU?

Lokman · March 15, 2011, 7:26am

Hi,

I tried Cuda with MPI on one node: Using MPI, the eight CPU processes launch the same task (an SIMD task) on the same 480 GTX. The whole solution using Cuda+MPI is faster than using only one CPU process with the GPU. Consequently, instead of using MPI, I tried another solution based on Cuda streams, but this solution does not work because “I think” I completly saturate the shared memory when using only one stream, thus I think that using various streams suppose that we have sufficient memory for all the streams.

My questions are the following: Why can I have a speedup (when using 8 processes instead of one, I have a speedup greater than 3) when using Cuda+MPI when it is not possible to have a similar speedup with one CPU process that uses different streams? How the eight CPU processes access to the GPU?

I will be very grateful for your responses

avidday · March 15, 2011, 8:16am

So the question really boils down to why running 8 separate contexts on one device simultaneously is faster than running a single context with streams instead?

The sane answer is probably that it should never be the case (given your streams implementation is not completely wrong). The thing that immediate springs to mind is that you might be doing this on an insane platform, like Windows Vista or 7. Otherwise it would be pretty hard to explain.

Lokman · March 16, 2011, 10:51pm

I use Linux and and I have a huge speedup using MPI (400%) and I succeeded to have a small speedup with streams (11%), however I still don’t understand how can I have such an important speedup with MPI. Do you know how different CPU processes access to the GPU?

tera · March 16, 2011, 11:49pm

Are you sure that the GPU is the bottleneck and not the CPU?

avidday · March 17, 2011, 12:46am

Multiple processes access the GPU serially, with context switching overhead in a completely non-deterministic fashion. Which is why your example is so counterintuitive. It suggests something other than the GPU is the performance limit of your code, like single threaded cpu performance or I/O.

Lokman · March 17, 2011, 5:55pm

The majority of computations (95%) are done on GPU. Besides, is there any chance that the execution of the kernel by the GPU is faster than launching the kernel by the CPU?

Thank you again.

avidday · March 17, 2011, 8:12pm

Very unlikely. Linux kernel launches only take about 15 to 20 microseconds. Context establishment takes of the order of 100 milliseconds. So the sum of those is about the minimum overhead per process.

tera · March 18, 2011, 1:10am

If I understand you correct, and 95% of your computations are done on the GPU and the GPU is more than 20 times faster than the CPU, then the CPU indeed is the bottleneck. Check Amdahl’s law.

Lokman · March 19, 2011, 10:40am

Using the Cuda-profiler, I think that I find the reason of my MPI speedup. This is due to a non-coalescent access of the GPU threads to the memory which is reduced when increasing the number of the logical processes of the CPU.

Thank you again for your responses.

avidday · March 19, 2011, 1:03pm

How would the performance impact of uncoalesced memory accesses be any different by running kernels in different processes compared with running them in the same context with streams?

Lokman · March 19, 2011, 6:19pm

I think that the title “Cuda streams vs Cuda+MPI” is no more justified. Indeed, the comparison between using streams and MPI was justified from the execution of kernels point of view. However, when using MPI, I reduce the uncoalesce of the data treated per kernel (fact that I neglected before using the CudaProfiler) which was not the case when using streams. In my actual program MPI and streaming play a complementary role, by MPI I reduce the uncoalesce of the data and using streams I saturate the GPU by the maximum of instructions.

tera · March 19, 2011, 9:52pm

Are you telling us that the CUDA+MPI version is faster because you optimized the kernels since you’ve tried the single threaded version, and you had forgotten about those changes before posting the question here? In that case, I would sincerely recommend looking at some good revision control software to keep track of the changes to your code.

Lokman · March 20, 2011, 12:08am

Oh no, I did not forget but I neglected: To some extent, I have always thought that if you give more data to the GPU you will have a better speedup than giving less data and launching more kernels. In the last example that I have treated it was not the case because the data of the big problem are badly distributed in the memory when the data of the small problems are very good distributed. This is why I was not aware of this uncoalesce problem until I used the CudaProfiler, it was not simple because I run a nonlinear Monte Carlo simulation with coupled paths.

Lokman · March 20, 2011, 6:22am

The correct word is “underestimate” and not “neglect”

Topic		Replies	Views
Multiple Streams Performance CUDA Programming and Performance	9	6568	October 19, 2010
Question about CUDA+MPI Legacy PGI Compilers	3	2716	March 13, 2018
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1599	September 14, 2017
MPS has gotten really good, but can CUDA streams replicate the benefits? CUDA Programming and Performance	1	540	September 23, 2024
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	24036	July 27, 2010
Multi-GPU, MPI or threads? best choice for my multi-GPU solution? CUDA Programming and Performance	11	13257	February 16, 2011
Why 2 parallel processes slower than 1 + 1 CUDA Programming and Performance	2	813	July 2, 2013
Streams and multiprocessor usage? CUDA Programming and Performance	3	2984	September 20, 2008
Performances of multi-thread vs multi-process with MPS CUDA Programming and Performance	2	3197	August 20, 2018
mpi + cuda/cublas race condition CUDA Programming and Performance	4	2003	June 20, 2011

Cuda streams vs Cuda+MPI How the different CPU processes access to the GPU?

Related topics