I’m building a program with CUDA Fortran. The environment only relies on nvidia/hpcsdk/2023(23.3) which I load this through environment-module. Then I use mpif90 in the communication library of hpcsdk to compile the source code, then I use slurm batch file to submit the task to the computing node:
Here you can see, I lauch 8 ranks on 2 nodes, each node covers 4 tasks with corresponding number of CPU and A100 GPU. The nodes are connected with ConnectX-5, each node contains 4 A100-SXM4-40G. However, it went out with warnings and it does not begin the computing even after couples of hours. The log is attached below:
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: g0165
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4123
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: g0165
Local device: mlx5_0
--------------------------------------------------------------------------
[g0165][[56167,1],4][../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[56167,1],5]
[g0164:164000] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[g0164:164000] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[g0164:164000] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init
As you can see, the slurm assigns nodes g0164 and g0165 to the program, and possibly the issue is related to the NIC, i.e., InfiniBand. There would be 3 potential reasons that I assumed:
The NVIDIA HPC SDK is not installed properly that the communication library can not recognize NIC in the cluster. In other words, in SLURM system it should use srun instead of mpirun, and srun needs additional configurations;
Only using the command line mpirun can not work properly. It should be accompanied with other specifications that the program can run well;
Computing nodes are not configured properly to cooperate with InfiniBand, such as the buffer size, etc.
This problem have bothered me for a long time. Are there any possible solutions to this issue? Many thanks!
For (3) you can check with a simple CPU application: have a look at MVAPICH :: Benchmarks. Try with 2 processes, using 2 nodes to check your infiniband ans slurm setup.
For (1) I’m unable to use the mpi flavor provided in NVIDIA HPC SDK with slurm on my local cluster as I need launching the code with srun to identify the allocated GPUs. I’m building my own version of OpenMPI with NVIDIA compilers. But it is still not fully operational at this time (see Howto build OpenMPI with nvhpc/24.1 - #4 by patrick.begou)
I have tested the CPU version of my code, where I used the module openmpi/4.1.0_IB_gcc9.3 pre-installed in the system. When I submit the task to the nodes, it can work properly, though certain warning messages:
[g0177:73014] mca_base_component_repository_open: unable to open mca_btl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0178:120519] mca_base_component_repository_open: unable to open mca_btl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0177:73014] mca_base_component_repository_open: unable to open mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0178:120519] mca_base_component_repository_open: unable to open mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: g0177
Local device: mlx5_0
--------------------------------------------------------------------------
[g0177:73014] mca_base_component_repository_open: unable to open mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0178:120519] mca_base_component_repository_open: unable to open mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
I guess the module nvidia/hpcsdk/2023 I used previously is not installed properly cooperated with IB. Maybe further configurations are needed for HPC SDK to work.
There is another interesting thing. When I use the same GPU program only use nvidia/hpcsdk/2023 with the same slurm batch file and submit to the NVIDIA V100 nodes which is connected by Omni-Path, it works correctly with no error and no warning. This confuse me as well.
It may be the case that the NICs aren’t ACTIVE state based on openib btl’s warnings. He should check what the output of ibv_devinfo is and if basic IB tests like ib_write_bw is working on the system between the two nodes.
We think the problem might be due to the OpenMPI version in use. 23.3 shipped with OpenMPI 3.1.5 as the default which doesn’t have parameters for device 4123 (ConnectX-6).
We also ship with 23.3 OpenMPI 4 (under “23.3/comm_libs/openmpi4/openmpi-4.0.5/”) as well as HPC-X (under “23.3/comm_libs/hpcx/hpcx-2.14/ompi/bin/”)
Hi Mat, very thank you for providing this valuable information. Indeed, when I use module /nvidia/hpcsdk/23.3-hpcx on multi nodes, the program can run now, only throwing the following waring:
[1709879141.969011] [g0169:252454:0] ucp_context.c:1849 UCX WARN UCP API version is incompatible: required >= 1.15, actual 1.13.0 (loaded from /usr/lib/gcc/x86_64-redhat-linux/4.8.5//../../../../lib64/libucp.so.0)
I have a minor question, does this warning mean a reduced performance of the communication if I do not update the version?
This warning means that UCX was loaded from default system location (from MLNX_OFED RPM) instead of HPC-X. You’ll want to export LD_LIBRARY_PATH (and add -x LD_LIBRARY_PATH to mpirun) to point to HPC-X location.