Building and Running Applications with HPC-X

Edit:

I fixed the issue below by switching to hpcx-2.6 which support ucx v1.8. So now I can run the hello_c example.

mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c

mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c

Hello, world, I am 1 of 2, (Open MPI v4.0.3rc4, package: Open MPI root@0e5a40994726 Distribution, ident: 4.0.3rc4, repo rev: v4.0.3rc4-6-g8b4a8cd34c, Unreleased developer copy, 148)

Hello, world, I am 0 of 2, (Open MPI v4.0.3rc4, package: Open MPI root@0e5a40994726 Distribution, ident: 4.0.3rc4, repo rev: v4.0.3rc4-6-g8b4a8cd34c, Unreleased developer copy, 148)

oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c

oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c

Hello, world, I am 0 of 2: http://www.open-mpi.org/ (version: 1.4)

Hello, world, I am 1 of 2: http://www.open-mpi.org/ (version: 1.4)

Original post:

Hi,

I am following the example provided in the hpcx documentation

I am receiving the following error. I compiled ucx-1.10 provided in the sources that came with hpcx and still get the same error.

ibdiagnet report summary:


Summary

-I- Stage Warnings Errors Comment

-I- Discovery 0 0

-I- Lids Check 0 0

-I- Links Check 0 0

-I- Subnet Manager 0 0

-I- Port Counters 0 0

-I- Nodes Information 0 0

-I- Speed / Width checks 0 0

-I- Alias GUIDs 0 0

-I- Virtualization 0 0

-I- Partition Keys 0 0

-I- Temperature Sensing 0 0

-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log

-I- ibdiagnet database file : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv

-I- LST file : /var/tmp/ibdiagnet2/ibdiagnet2.lst

-I- Network dump file : /var/tmp/ibdiagnet2/ibdiagnet2.net_dump

-I- Subnet Manager file : /var/tmp/ibdiagnet2/ibdiagnet2.sm

-I- Ports Counters file : /var/tmp/ibdiagnet2/ibdiagnet2.pm

-I- Nodes Information file : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info

-I- Alias guids file : /var/tmp/ibdiagnet2/ibdiagnet2.aguid

-I- VPorts file : /var/tmp/ibdiagnet2/ibdiagnet2.vports

-I- VPorts Pkey file : /var/tmp/ibdiagnet2/ibdiagnet2.vports_pkey

-I- Partition keys file : /var/tmp/ibdiagnet2/ibdiagnet2.pkey

Any help would be appreciated to get me started on being able to run mpirun on all my nodes would be greatly appreciated.

Cheers

(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ source hpcx-init.sh

(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ hpcx_load

(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ env | grep HPCX

HPCX_HCOLL_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/hcoll

HPCX_CLUSTERKIT_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/clusterkit

HPCX_OSU_CUDA_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2-cuda

HPCX_OSU_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2

HPCX_MPI_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi

HPCX_OSHMEM_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi

HPCX_HOME=/home/user/infiniband/hpcx

HPCX_UCX_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ucx

HPCX_IPM_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/ipm-2.0.6

HPCX_SHARP_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/sharp

HPCX_IPM_LIB=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/ipm-2.0.6/lib/libipm.so

HPCX_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64

HPCX_NCCL_RDMA_SHARP_PLUGIN_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/nccl_rdma_sharp_plugin

HPCX_MPI_TESTS_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests

(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c

(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c

[1642388680.640057] [oak-rd0-linux:2429918:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)

[1642388680.640358] [oak-rd0-linux:2429919:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)

[1642388680.680581] [oak-rd0-linux:2429918:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)

[1642388680.681228] [oak-rd0-linux:2429919:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)

/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/examples/hello_c: symbol lookup error: /home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/lib/openmpi/mca_pml_ucx.so: undefined symbol: ucp_tag_recv_nbx

/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/examples/hello_c: symbol lookup error: /home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/lib/openmpi/mca_pml_ucx.so: undefined symbol: ucp_tag_recv_nbx


Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.



mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

Process name: [[6258,1],0]

Exit code: 127

Hello,

When application compiles against specific version of MPI, same version of MPI must be present on the compute node to run the application. For your case, UCX must be there.

Looking on the output, it seems that host is not finding required UCX version - looking for v1.10, that seems to be part of HPC-X, but finding v1.8 one that installed under /lib64. Try to follow steps from HPCX user manual to compile and run application and if possible not use user root.

One note, when running mpirun, could you double check that HPC-X environment initiated correctly ( export HPCX_HOME; source $HPCX_HOME/hpcx-init.sh; hpcx_load ) and check if adding ‘-x LD_LIBRARY_PATH’ help.

Best Regards,

Viki

Hi Viki,

update:

I have now installed on all 3 of my nodes (with ubuntu 20.04 5.4.0-26-generic kernel):

MLNX_OFED_LINUX-5.0-2.1.8.0-ubuntu20.04-x86_64

and

hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64

MPI TESTS works fine on each individual node. My current issue is to be able to use mpirun on my main node (with opensm enabled on that main node) and use the cpu s in the other nodes.

Example :

I am on oak-rd0-linux (Main node), opensm is running, ibdiagnet does not report any warning or errors and I am trying to test using the cpu on oak-rd1-linux (host1) and oak-rd2-linux (host2) with:

mpirun -x LD_LIBRARY_PATH -np 2 -H oak-rd1-linux,oak-rd2-linux $HPCX_MPI_TESTS_DIR/examples/hello_c

Nothing happens - it seems to hang and I am not sure where to go from here. What am I doing wrong at this step and what can I check to identify to problem?

sudo ibnetdiscover output:

Topology file: generated on Tue Jan 25 16:31:42 2022

Initiated from node 0010e00001885688 port 0010e0000188568a

vendid=0x2c9

devid=0xc738

sysimgguid=0xe41d2d0300b39ee0

switchguid=0xe41d2d0300b39ee0(e41d2d0300b39ee0)

Switch 12 “S-e41d2d0300b39ee0” # “SwitchX - Mellanox Technologies” base port 0 lid 3 lmc 0

[1] "H-0010e00001885688"2 # “oak-rd0-linux HCA-1” lid 1 4xQDR

[2] "H-0010e000018d08e0"1 # “oak-rd1-linux HCA-1” lid 4 4xQDR

[3] "H-0010e00001885908"1 # “oak-rd2-linux HCA-1” lid 2 4xQDR

vendid=0x2c9

devid=0x1003

sysimgguid=0x10e0000188590b

caguid=0x10e00001885908

Ca 2 “H-0010e00001885908” # “oak-rd2-linux HCA-1”

1 “S-e41d2d0300b39ee0”[3] # lid 2 lmc 0 “SwitchX - Mellanox Technologies” lid 3 4xQDR

vendid=0x2c9

devid=0x1003

sysimgguid=0x10e000018d08e3

caguid=0x10e000018d08e0

Ca 2 “H-0010e000018d08e0” # “oak-rd1-linux HCA-1”

1 “S-e41d2d0300b39ee0”[2] # lid 4 lmc 0 “SwitchX - Mellanox Technologies” lid 3 4xQDR

vendid=0x2c9

devid=0x1003

sysimgguid=0x10e0000188568b

caguid=0x10e00001885688

Ca 2 “H-0010e00001885688” # “oak-rd0-linux HCA-1”

2 “S-e41d2d0300b39ee0”[1] # lid 1 lmc 0 “SwitchX - Mellanox Technologies” lid 3 4xQDR