Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand

kyran_x · March 3, 2024, 8:56am

Hi Guys,

I’m building a program with CUDA Fortran. The environment only relies on nvidia/hpcsdk/2023(23.3) which I load this through environment-module. Then I use mpif90 in the communication library of hpcsdk to compile the source code, then I use slurm batch file to submit the task to the computing node:

#!/bin/bash

#SBATCH --partition=a100x4
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --gpus-per-task=1

module load nvidia/hpcsdk/2023
mpirun -np 8 ./gpic

Here you can see, I lauch 8 ranks on 2 nodes, each node covers 4 tasks with corresponding number of CPU and A100 GPU. The nodes are connected with ConnectX-5, each node contains 4 A100-SXM4-40G. However, it went out with warnings and it does not begin the computing even after couples of hours. The log is attached below:

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            g0165
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   g0165
  Local device: mlx5_0
--------------------------------------------------------------------------
[g0165][[56167,1],4][../../../../../opal/mca/btl/tcp/btl_tcp_endpoint.c:626:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[56167,1],5]
[g0164:164000] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[g0164:164000] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[g0164:164000] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init

As you can see, the slurm assigns nodes g0164 and g0165 to the program, and possibly the issue is related to the NIC, i.e., InfiniBand. There would be 3 potential reasons that I assumed:

The NVIDIA HPC SDK is not installed properly that the communication library can not recognize NIC in the cluster. In other words, in SLURM system it should use srun instead of mpirun, and srun needs additional configurations;
Only using the command line mpirun can not work properly. It should be accompanied with other specifications that the program can run well;
Computing nodes are not configured properly to cooperate with InfiniBand, such as the buffer size, etc.

This problem have bothered me for a long time. Are there any possible solutions to this issue? Many thanks!

patrick.begou · March 4, 2024, 3:08pm

Hi,

For (3) you can check with a simple CPU application: have a look at MVAPICH :: Benchmarks. Try with 2 processes, using 2 nodes to check your infiniband ans slurm setup.

For (1) I’m unable to use the mpi flavor provided in NVIDIA HPC SDK with slurm on my local cluster as I need launching the code with srun to identify the allocated GPUs. I’m building my own version of OpenMPI with NVIDIA compilers. But it is still not fully operational at this time (see Howto build OpenMPI with nvhpc/24.1 - #4 by patrick.begou)

Patrick

kyran_x · March 5, 2024, 3:50am

Hi Patrick,

I have tested the CPU version of my code, where I used the module openmpi/4.1.0_IB_gcc9.3 pre-installed in the system. When I submit the task to the nodes, it can work properly, though certain warning messages:

[g0177:73014] mca_base_component_repository_open: unable to open mca_btl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0178:120519] mca_base_component_repository_open: unable to open mca_btl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0177:73014] mca_base_component_repository_open: unable to open mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0178:120519] mca_base_component_repository_open: unable to open mca_btl_usnic: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   g0177
  Local device: mlx5_0
--------------------------------------------------------------------------
[g0177:73014] mca_base_component_repository_open: unable to open mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)
[g0178:120519] mca_base_component_repository_open: unable to open mca_mtl_ofi: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory (ignored)

I guess the module nvidia/hpcsdk/2023 I used previously is not installed properly cooperated with IB. Maybe further configurations are needed for HPC SDK to work.

There is another interesting thing. When I use the same GPU program only use nvidia/hpcsdk/2023 with the same slurm batch file and submit to the NVIDIA V100 nodes which is connected by Omni-Path, it works correctly with no error and no warning. This confuse me as well.

MatColgrove · March 5, 2024, 4:38pm

I’m not sure myself, but have reached out to our InfiniBand folks to see if they have any advice.

kyran_x · March 6, 2024, 2:59am

Hi Mat, thank you again for your kind help. I’ll be looking forward to your reply/solution these days.

MatColgrove · March 6, 2024, 4:48pm

Here’s the response I got back:

It may be the case that the NICs aren’t ACTIVE state based on openib btl’s warnings. He should check what the output of ibv_devinfo is and if basic IB tests like ib_write_bw is working on the system between the two nodes.

Server node cmd: ib_write_bw --all

Client node cmd: ib_write_bw --all server-hostname

Can you run "ibv_devinfo’ and check if the IB tests work correctly?

keroro · March 6, 2024, 5:15pm

Have you tried using the -gpu flag in the mpirun command? which supercomputer are you using? Meluxina?

kyran_x · March 7, 2024, 2:11am

Hi Mat, the information is listed below:

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.31.2354
        node_guid:                      08c0:eb03:0044:47f6
        sys_image_guid:                 08c0:eb03:0044:47f6
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000222
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffffffffff000
        max_qp:                         262144
        max_qp_wr:                      32768
        device_cap_flags:               0xe97e1c36
                                        BAD_PKEY_CNTR
                                        BAD_QKEY_CNTR
                                        AUTO_PATH_MIG
                                        CHANGE_PHY_PORT
                                        PORT_ACTIVE_EVENT
                                        SYS_IMAGE_GUID
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        UD_IP_CSUM
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
                                        MANAGED_FLOW_STEERING
                                        Unknown flags: 0xC8480000
        max_sge:                        30
        max_sge_rd:                     30
        max_cq:                         16777216
        max_cqe:                        4194303
        max_mr:                         16777216
        max_pd:                         8388608
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                4194304
        max_qp_init_rd_atom:            16
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         16777216
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  2097152
        max_mcast_qp_attach:            240
        max_total_mcast_qp_attach:      503316480
        max_ah:                         2147483647
        max_fmr:                        0
        max_srq:                        8388608
        max_srq_wr:                     32767
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             16
        general_odp_caps:
                                        ODP_SUPPORT
                                        ODP_SUPPORT_IMPLICIT
        rc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_RECV
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        SUPPORT_SEND
        xrc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        completion timestamp_mask:                      0x7fffffffffffffff
        hca_core_clock:                 156250kHZ
        device_cap_flags_ex:            0x30000051E97E1C36
                                        PCI_WRITE_END_PADDING
                                        Unknown flags: 0x3000004100000000
        tso_caps:
                max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        max_rndv_hdr_size:              64
        max_num_tags:                   127
        max_ops:                        32768
        max_sge:                        1
        flags:
                                        IBV_TM_CAP_RC

        cq moderation caps:
                max_cq_count:   65535
                max_cq_period:  4095 us

        maximum available device memory:        262144Bytes

        num_comp_vectors:               63
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 4
                        port_lid:               18
                        port_lmc:               0x00
                        link_layer:             InfiniBand
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x2251e848
                        port_cap_flags2:        0x0032
                        max_vl_num:             4 (3)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           128
                        gid_tbl_len:            8
                        subnet_timeout:         18
                        init_type_reply:        0
                        active_width:           2X (16)
                        active_speed:           50.0 Gbps (64)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:08c0:eb03:0044:47f6

It seems IB works correctly, which is very weird.

kyran_x · March 7, 2024, 2:16am

Hi keroro, it seems that mpirun command cannot combine the flag -gpu. Btw, the GPU nodes are supermicro.

MatColgrove · March 7, 2024, 4:53pm

We think the problem might be due to the OpenMPI version in use. 23.3 shipped with OpenMPI 3.1.5 as the default which doesn’t have parameters for device 4123 (ConnectX-6).

We also ship with 23.3 OpenMPI 4 (under “23.3/comm_libs/openmpi4/openmpi-4.0.5/”) as well as HPC-X (under “23.3/comm_libs/hpcx/hpcx-2.14/ompi/bin/”)

You also might want to update the compilers to 24.1 (https://developer.nvidia.com/hpc-sdk-downloads) since HPC-X is now the default MPI and is a more recent version.

kyran_x · March 8, 2024, 12:46pm

Hi Mat, very thank you for providing this valuable information. Indeed, when I use module /nvidia/hpcsdk/23.3-hpcx on multi nodes, the program can run now, only throwing the following waring:

[1709879141.969011] [g0169:252454:0]     ucp_context.c:1849 UCX  WARN  UCP API version is incompatible: required >= 1.15, actual 1.13.0 (loaded from /usr/lib/gcc/x86_64-redhat-linux/4.8.5//../../../../lib64/libucp.so.0)

I have a minor question, does this warning mean a reduced performance of the communication if I do not update the version?

MatColgrove · March 11, 2024, 3:02pm

This warning means that UCX was loaded from default system location (from MLNX_OFED RPM) instead of HPC-X. You’ll want to export LD_LIBRARY_PATH (and add -x LD_LIBRARY_PATH to mpirun) to point to HPC-X location.

system · March 25, 2024, 3:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request support/help for PBS with OpenMPI Legacy PGI Compilers	21	15137	August 9, 2022
Howto build OpenMPI with nvhpc/24.1 nvc, nvc++ and nvfortran openmpi	5	2710	March 18, 2024
Opal_init internal failure at runtime in 23.11 nvc, nvc++ and nvfortran	10	1611	December 20, 2023
Error to MPI multi-node run HPC-Benchmark container enroot/pyxis Container: HPC hpc , openmpi , benchmarks	1	3209	August 26, 2022
Building openMPI with UCX - General Advice Software And Drivers	4	5202	January 25, 2022
An error occurred when using MPI and OpenACC together nvc, nvc++ and nvfortran	11	1162	April 26, 2023
Problem while running application on multiple nodes on SR-IOV enviroment using OpenMPI [Build from source]	0	344	March 13, 2018
Failed to create a queue pair (QP) nvc, nvc++ and nvfortran	3	1071	December 29, 2022
HPC SDK 21.09 OpenMPI + lmod + Slurm CUDA Programming and Performance openmpi	0	1605	January 24, 2022
Mpirun 3.1.5 bundled with HPC SDK 20.11 does not run between nodes nvc, nvc++ and nvfortran	1	952	March 23, 2021

Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand

Related topics