Hi, I am trying to use OpenMPI to run containers, following the instructions here,
I have successfully built OpenMPI(3.1.4) with UCX(1.10) and CUDA(11.6) support. I am trying the following command to launch the job on 1 node with 2 NVIDIA A100 GPUs:
I expect it to map GPU0 to mpi_rank0, and GPU1 to mpi_rank1, just like common CUDA-aware MPI does. But checking the nvidia-smi, it seems that the application on each MPI process somehow uses all GPUs, which is not ideal.
I have also tried to call MPI inside the container, as follows:
How is the application setting the rank to device binding?
Typically this done in the program using cudaSetDevice if this is a CUDA program or “acc_set_device” is using OpenACC. (what programming language is this using?)
If the device setting is done after calling MPI_Init, OpenMPI will create a CUDA context on the default device so nvidia-smi will show an extra process per rank. Likely what’s happening here.
It is a bit odd that it appears the contexts are being created on device 1 and then both ranks are setting themselves to use device 0, but possible if the default device is 1.
Currently, they don’t support multiprocess per node, I’m just wondering then how intra-node GPUs can communicate with MPI, since they are under one MPI process, aka the same rank.
You’re correct that it’s unlikely they’d be able use CUDA Aware MPI. It’s possible they are GPU Direct calls or NVSHEMM, but more likely they don’t do direct communication. Though I don’t know the application, so probably best to ask the developers.