GTC 2020: Multi-GPU Programming with Message-Passing Interface

GTC 2020 S21067
Presenters: Jiri Kraus,NVIDIA
Learn how to program multi-GPU systems or GPU clusters using the message-passing interface (MPI) and OpenACC or NVIDIA CUDA. We’ll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we’ll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We’'ll also cover the latest improvements with CUDA-aware MPI, interaction with unified memory, the multi-process service, and MPI support in NVIDIA performance analysis tools.

Watch this session
Join in the conversation below.

Hi Jiri, thanks for the great video. I have a question regarding implementation with CUDA-aware MPI.

As per the OpenMPI page, they suggest that you set the device prior to calling MPI_Init ( Is that outdated information? If not, how do we go about doing that when we determine the GPU using the local rank?


Hi Todd,

thanks for the feedback on my talk. Regarding your question: Yes for most implementations that is outdated information it is no longer needed to set the device before calling MPI_Init. However it still should be selected before calling MPI routines that require a GPU context and not change after that. What exact MPI routines require a GPU context depends on your MPI implementation. A mental model that works for me is: All MPI communication routines that accept a pointer or reference to data that should be communicated need a GPU context. Creating a new communicator with e.g. MPI_Comm_split_type and querying the size or rank of a communicator does not require GPU context and can thus be used before selecting a device.

However I think the FAQ entry you are referring to is “10. What are some guidelines for using CUDA and Open MPI with Omni-Path?” which is specific to Omni-Path. So if you are using the GPUDirect RDMA support of Omni-Path it is probably still required to select a device before calling MPI_Init. I will try to get a clarification on that.

If not, how do we go about doing that when we determine the GPU using the local rank?

You can use getenv and query a environment variable set to the local rank on a node. All launchers I am aware of set a variable like this. E.g. the OpenMPI launcher set OMPI_COMM_WORLD_LOCAL_RANK and the MVAPICH2 launcher sets MV2_COMM_WORLD_LOCAL_RANK.

I hope this helps.



1 Like

Thanks for the response, Jiri. That answered my question.

I did have one more question regarding MPI datatypes with CUDA-aware MPI. I know this is probably implementation-specific, but I was wondering about your thoughts regarding using MPI_Type_indexed for communicating between GPU’s.

Would it be better to pack/unpack my own buffers on the device and pass those to Isend/Irecv? (as opposed to using the MPI_Type_indexed on a GPU data pointer).

Hi Todd,

as you say the support for MPI datatype processing with CUDA-aware MPI is implementation specific, e.g. MVAPICH2-GDR has support for GPU side packing: However beside varying support for this in different CUDA-aware MPI implementations there are also some general considerations. MPI internal packing of data can be more efficiently pipelined with inter GPU data movement via the network or inter node. However it requires the MPI to launch CUDA kernels and you application has no control when they are launched and in which streams. In case your are overlapping MPI communication with kernel execution that can cause performance issues so you might be better of doing the packing application side.

Hope this helps


1 Like

Thanks, Jiri, that helps a lot.

Thanks again for taking the time to answer my questions.

Excited by mention of WRF (Weather Research & Forecasting) Model acceleration with OpenACC directives in the newly released NVIDIA HPC SDK, I am seeking any advice or direction towards a collaboration. Would you perhaps know where I could ask more about this?

Thanks Bennet for reaching out to us. A colleague of mine will follow up with you offline.