Edit:
I fixed the issue below by switching to hpcx-2.6 which support ucx v1.8. So now I can run the hello_c example.
mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
Hello, world, I am 1 of 2, (Open MPI v4.0.3rc4, package: Open MPI root@0e5a40994726 Distribution, ident: 4.0.3rc4, repo rev: v4.0.3rc4-6-g8b4a8cd34c, Unreleased developer copy, 148)
Hello, world, I am 0 of 2, (Open MPI v4.0.3rc4, package: Open MPI root@0e5a40994726 Distribution, ident: 4.0.3rc4, repo rev: v4.0.3rc4-6-g8b4a8cd34c, Unreleased developer copy, 148)
oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
Hello, world, I am 0 of 2: http://www.open-mpi.org/ (version: 1.4)
Hello, world, I am 1 of 2: http://www.open-mpi.org/ (version: 1.4)
Original post:
Hi,
I am following the example provided in the hpcx documentation
I am receiving the following error. I compiled ucx-1.10 provided in the sources that came with hpcx and still get the same error.
ibdiagnet report summary:
Summary
-I- Stage Warnings Errors Comment
-I- Discovery 0 0
-I- Lids Check 0 0
-I- Links Check 0 0
-I- Subnet Manager 0 0
-I- Port Counters 0 0
-I- Nodes Information 0 0
-I- Speed / Width checks 0 0
-I- Alias GUIDs 0 0
-I- Virtualization 0 0
-I- Partition Keys 0 0
-I- Temperature Sensing 0 0
-I- You can find detailed errors/warnings in: /var/tmp/ibdiagnet2/ibdiagnet2.log
-I- ibdiagnet database file : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv
-I- LST file : /var/tmp/ibdiagnet2/ibdiagnet2.lst
-I- Network dump file : /var/tmp/ibdiagnet2/ibdiagnet2.net_dump
-I- Subnet Manager file : /var/tmp/ibdiagnet2/ibdiagnet2.sm
-I- Ports Counters file : /var/tmp/ibdiagnet2/ibdiagnet2.pm
-I- Nodes Information file : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info
-I- Alias guids file : /var/tmp/ibdiagnet2/ibdiagnet2.aguid
-I- VPorts file : /var/tmp/ibdiagnet2/ibdiagnet2.vports
-I- VPorts Pkey file : /var/tmp/ibdiagnet2/ibdiagnet2.vports_pkey
-I- Partition keys file : /var/tmp/ibdiagnet2/ibdiagnet2.pkey
Any help would be appreciated to get me started on being able to run mpirun on all my nodes would be greatly appreciated.
Cheers
(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ source hpcx-init.sh
(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ hpcx_load
(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ env | grep HPCX
HPCX_HCOLL_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/hcoll
HPCX_CLUSTERKIT_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/clusterkit
HPCX_OSU_CUDA_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2-cuda
HPCX_OSU_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/osu-micro-benchmarks-5.6.2
HPCX_MPI_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi
HPCX_OSHMEM_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi
HPCX_HOME=/home/user/infiniband/hpcx
HPCX_UCX_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ucx
HPCX_IPM_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/ipm-2.0.6
HPCX_SHARP_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/sharp
HPCX_IPM_LIB=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/ipm-2.0.6/lib/libipm.so
HPCX_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64
HPCX_NCCL_RDMA_SHARP_PLUGIN_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/nccl_rdma_sharp_plugin
HPCX_MPI_TESTS_DIR=/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests
(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
(base) user@oak-rd0-linux:~/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64$ mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
[1642388680.640057] [oak-rd0-linux:2429918:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)
[1642388680.640358] [oak-rd0-linux:2429919:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)
[1642388680.680581] [oak-rd0-linux:2429918:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)
[1642388680.681228] [oak-rd0-linux:2429919:0] ucp_context.c:1467 UCX WARN UCP version is incompatible, required: 1.10, actual: 1.8 (release 0 /lib/libucp.so.0)
/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/examples/hello_c: symbol lookup error: /home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/lib/openmpi/mca_pml_ucx.so: undefined symbol: ucp_tag_recv_nbx
/home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/tests/examples/hello_c: symbol lookup error: /home/user/infiniband/hpcx-v2.8.1-gcc-MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu20.04-x86_64/ompi/lib/openmpi/mca_pml_ucx.so: undefined symbol: ucp_tag_recv_nbx
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[6258,1],0]
Exit code: 127