Benchmarking CUDA-Aware MPI

Originally published at:

I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I will demonstrate the performance of MPI through both synthetic and realistic benchmarks. Synthetic MPI Benchmark results Since you now know why CUDA-aware MPI is more efficient from…

Is this code compatible with the graphics cores in an cluster of Jetson TK1s?

Hi Spencer, yes the Jacobi solver should run on a Tegra K1 if you can get a working CUDA-aware MPI there. I would not expect issues with compiling OpenMPI or MVAPICH2 on the Tegra K1, but I have not tested this myself. One change you should make is to change USE_FLOAT in Jacobi.h to 1. The code would also work with double precision but slower. Also please be aware that no variant of GPUDirect is available on a cluster of Tegra K1. Jiri

Hi Jiri, Thanks for responding, I was able to get the code to compile. I have 40 Jetson TK1s, so would it be best to run this code with mpirun -np 40 ./jacobi_cuda_aware_mpi -t 40 1 ? or with the number of cores the jetson has; mpirun -np 160 ./jacobi_cuda_aware_mpi -160 1 ? Also, would it be best to compile with GENCODE 30, since the jetson supports 3.2? In addition, running the jacobi_cuda_aware_mpi give me the error "The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol cannot be used." This is what you were referring to when you said GPUDirect is not available?

Hi Spencer,
good to hear that you could get it to compile. To answer your questions:
1. Its most efficient to start one MPI process per GPU, i.e. one process per Jetson TK1, which is 40 in your case. You probably want to try different process layouts. E.g. Try a 8 x 5 layout with mpirun -np 40 ./jacobi_cuda_aware_mpi -t 8 5.
2. You should compile for compute capability 3.2 by adding -gencode arch=compute_32,code=sm_32 to the Makefile.
3. You are right you get the error because GPUDirect is not available. If you are using OpenMPI you can disable CUDA IPC (GPUDirect P2P between processes) by passing --mca btl_smcuda_use_cuda_ipc 0 as an argument to mpirun (see to avoid that error.

Thanks again,
Yes, I am running OpenMPI, 1.10.2, I was able to run the code. The cuda_aware version with the CUDA IPC disabled performs about 4 times fewer GFLOPS compared to the cuda_normal code. I was expecting it to perform better. I was reading about using ZeroCopy to replace the GPUDirect, because the Jetson TK1 is a shared memory system. In this case, both codes use CUDA, but CUDA_AWARE uses CUDA with MPI, and CUDA_NORMAL uses CUDA on each host, is this correct?

Hi Spencer, it is indeed unexpected that the CUDA-aware version performs so much worse compared to the regular MPI version. However as you point out using zero-copy memory and a regular MPI will be the more efficient thing to do on the Jetson's anyway. Regarding the difference between the two versions of the code: The CUDA-aware version passes device pointers directly into MPI and let the CUDA-aware MPI handle the data transport. The regular MPI version uses cudaMemcpy to stage device memory to host memory before passing it into MPI. When you switch to zero-copy you would use a regular MPI and don't do any staging with cudaMemcpy. Jiri

Hello, I'm running the code on a single tesla k40 card with the default size (4096x4096) and it gives me GLU / s only 2.53 in double precision, if I do it in single presicion I get double, but I can not reproduce the first point of the weak scalability graph, I have to take into account something else? regards

Hi Ferche, did you pass in the command line option for fast swap (-fs)? Sorry that this is not explicitly mentioned in the post above.

Hi Jiri, thanks for you reply, now with that option I reach on average 7.68 GLU / s that is closer to benchmark, this in single precision.
In double precision I get an average of 5.17 GLU / s. The benchmark is in single precision?
We are configuring a cluster of GPUs, and to corroborate the scalability and make sure that we have it correctly configured, we are basing ourselves on this benchmark.
We are running for 4, 8 and 16 GPUs but we are far from hitting the benchmark, I think we are doing something wrong.
If you can help us it would be very valuable, I leave you my e-mail if you wish,


Sorry, the e-mail is wrong, the correct is

To validate your cluster setup I would recommend that you start of with some MPI micro benchmarks and check if you are hitting expected inter GPU bandwidths and latencies (both intra node and inter node). E.g. the OSU ones: http://mvapich.cse.ohio-sta...
Regarding the Jacobi benchmark described in this blog post one reason why you might not see the expected behavior is that GPU affinity is not handled correctly. Did you make sure that the macro ENV_LOCAL_RANK in src/Jacobi.h matches the environment variable exported by your MPI launcher? It defaults to MV2_COMM_WORLD_LOCAL_RANK and would need to be changed to OMPI_COMM_WORLD_LOCAL_RANK in case of OpenMPI or something different depending on your MPI launcher. Last but not least I would also like to point you to another github repository with some multi GPU example codes: and a recording of a GTC Talk on Multi GPU Programming with MPI: http://on-demand-gtc.gputec...

Thanks again Jiri, your help has been very valuable! Regards

Good article!. I have some questions related to CUDA aware MPI. Consider this common pattern: when one sends/receives messages, on the sender side, one has to gather data from local data structures and put it in a send buffer (i.e., Pack); on the receiver side, one has to scatter (i.e., unpack) data from recv buffer to local data structures.
a) On communication / computation overlapping
On sender side, my practice is: pack data for one neighbor, then MPI_Isend it, then pack for next neighbor. It looks packing and sending are overlapped. If use CUDA aware MPI, the pack routine is going to be a CUDA kernel. I can either
1) Call one kernel to pack data for all neighbors, then call multiple MPI_Isends; OR
2) Call multiple kernels and interleave pack, Isend, pack, Isend.
Method 1 uses fewer kernels, but does not do comm / comp overlapping, while method 2 does. So, which one is better? Is comm. / comp. overlapping worthwhile on GPU?

b) On atomic operation
On the receiver side, occasionally, the received data is supposed to be reduced to user's data. At runtime, some entries (not all) in the received buffer have the same destination -- they are reduced to the same location in the user's data structure (e.g., ADD, MAX, etc). Are there tips to implement this unpack kernel efficiently?


Hi Junchao,

thanks for your comments. These are good questions. Let me try to answers:
a) On communication / computation overlapping: Ultimately what is the best strategy will depend on multi factors. To my experience a generally good strategy is to first asynchronously launch CUDA kernels operating on data not involved in communication and then do the MPI communication synchronously (might be a series of MPI_Isend + MPI_Irecv + MPI_Waitall) while the CUDA kernel runs. In case of non contiguous communication buffers a possible strategy would be to first launch 1 packing kernel in stream A, then launch the kernel processing the interior of the domain in stream B, wait for stream A to finish and do all MPI while the kernel in stream B is still running. That way you do not get overlap between packing an communication, but in many cases you can still hide communication times and avoid launch overheads for many small packing kernels. On this topic you might also want to look at GPU side data type processing as e.g. MVAPICH2-GDR support it: http://mvapich.cse.ohio-sta...
b) On atomic operation: I am sorry I don't think I understand you question sufficiently. Is you use case covered by a collective implemented in CUDA-aware MPI or NCCL: ? NCCL is interoperabel with MPI so it might be a good option for you.

Hope this helps


Thanks for the info. By atomic operations, I meant this common code pattern: Once MPI ranks received their data in their receive buffer, they need to "Reduce" the data into their local data structure (for example, an array u[]). Let's say we need to do u[idx[i]] += buf[i]. The index array idx[] is given but its content is runtime dependent. If there are chances that idx[i] = idx[j], to avoid data race, the ADD operation has to be atomic. I can blindly use atomics even in reality there is no conflicts, or, I can analyze the idx[] array to test if there are conflicts and dispatch the code path to one without conflicts using regular operations. and the other with conflicts using atomicd. Of course, the latter will complicate the code.
I was told today that as long as the data is in global memory, atomics are as fast as regular instructions when there are no conflicts. So I prefer to using atomics blindly in this case.

Thanks for clarifying. So you are referring to a completely local operation. I was confused as I have read your statement in the context of MPI. In this case I agree with your conclusion. It is very likely that just using global memory atomic operations will give you the best performance (compared with other more complicated approaches like sorting the index array).