Assignment of MPI ranks to GPUs

Hi,

I am working at the IBM T J Watson Research Center and I noticed a strange phenomenon. If I run a program using 20 MPI ranks on a node with 4 Pascal GPUs and I evenly distribute the MPI ranks to the GPUs then, at run time, running nvidia-smi, I can see the following thing:

| 0 25103 C ./hacc_tpm 287MiB |
| 0 25104 C ./hacc_tpm 287MiB |
| 0 25105 C ./hacc_tpm 287MiB |
| 0 25106 C ./hacc_tpm 287MiB |
| 0 25107 C ./hacc_tpm 287MiB |
| 0 25108 C ./hacc_tpm 287MiB |
| 0 25110 C ./hacc_tpm 287MiB |
| 0 25111 C ./hacc_tpm 287MiB |
| 0 25112 C ./hacc_tpm 287MiB |
| 0 25113 C ./hacc_tpm 287MiB |
| 0 25114 C ./hacc_tpm 287MiB |
| 0 25115 C ./hacc_tpm 287MiB |
| 0 25116 C ./hacc_tpm 287MiB |
| 0 25117 C ./hacc_tpm 287MiB |
| 0 25121 C ./hacc_tpm 287MiB |
| 0 25126 C ./hacc_tpm 287MiB |
| 1 25104 C ./hacc_tpm 287MiB |
| 1 25108 C ./hacc_tpm 287MiB |
| 1 25113 C ./hacc_tpm 287MiB |
| 1 25117 C ./hacc_tpm 287MiB |
| 2 25105 C ./hacc_tpm 287MiB |
| 2 25110 C ./hacc_tpm 287MiB |
| 2 25114 C ./hacc_tpm 287MiB |
| 2 25121 C ./hacc_tpm 287MiB |
| 3 25106 C ./hacc_tpm 287MiB |
| 3 25111 C ./hacc_tpm 287MiB |
| 3 25115 C ./hacc_tpm 287MiB |
| 3 25126 C ./hacc_tpm 287MiB |

Apparently there are a total of 32 MPI ranks, but they are really 20 checking the PID. The MPI ranks on GPU 1, 2 and 3 appear also on GPU 0.

The only way that I could solve this was creating the following shell script (shell_script.sh):

#!/bin/bash
let ngpus=4
if [[ -n ${OMPI_COMM_WORLD_LOCAL_RANK} ]]
then
let lrank=${OMPI_COMM_WORLD_LOCAL_RANK}
let device=$lrank/$ngpus
export CUDA_VISIBLE_DEVICES=$device
fi
echo $lrank $device $CUDA_VISIBLE_DEVICES
echo “$@”
“$@”

When I run the code I use the following command:

mpirun … shell_script.sh ./executable …

I would like to know if you are aware of this problem because for CORAL we would like to create the baseline versions of 13 DoE benchmark applications using OpenACC and the PGI compiler.

Best,
Fausto

Hi Fausto,

I would like to know if you are aware of this problem because for CORAL we would like to create the baseline versions of 13 DoE benchmark applications using OpenACC and the PGI compiler.

Yes, we are aware of this issue.

What’s happening is that there’s some device initialization that occurs during the loading of the binary. Since the device number hasn’t been set, the default device is used and why you see these extra usages. Note that there isn’t a performance penalty but there is some wasted memory on the default device.

Your solution of using CUDA_VISIBLE_DEVICES is what I typically recommend.

Best Regards,
Mat