Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand


For (3) you can check with a simple CPU application: have a look at MVAPICH :: Benchmarks. Try with 2 processes, using 2 nodes to check your infiniband ans slurm setup.

For (1) I’m unable to use the mpi flavor provided in NVIDIA HPC SDK with slurm on my local cluster as I need launching the code with srun to identify the allocated GPUs. I’m building my own version of OpenMPI with NVIDIA compilers. But it is still not fully operational at this time (see Howto build OpenMPI with nvhpc/24.1 - #4 by patrick.begou)