How to run HPL script over Ethernet

Hello, I’am struggling w/ mpirun.
I want to test A100 HPL Benchmark. My node doesn’t support infini-band.
I have 2 nodes with 8 gpu per node.

We are trying to solve this problem in these ways.

using mpirun

mpirun -np 2 -hostfile ./hosts  --allow-run-as-root \
    singularity run --nv \
     -B "HPL.dat" "hpc-benchmarks:24.03.sif" \
     /workspace/hpl.sh --dat /my-dat-files/HPL_multinode.dat

The error code is

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/topo/topo.cpp:476: [GPU 0] Peer GPU 1 is not accessible, exiting ... 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:972: non-zero status: 3 building transport map failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting 

WARN: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Invalid argument, exiting... mutex destroy failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/topo/topo.cpp:476: [GPU 1] Peer GPU 0 is not accessible, exiting ... 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:972: non-zero status: 3 building transport map failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting 

WARN: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Invalid argument, exiting... mutex destroy failed 

hostfile is

gpu01 slots=1
gpu02 slots=1

HPL.dat is

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
17400   	 Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

I have no idea what problem is. Please help me.

Hi Rabbins03,

Which version of MPI are you using?
What version of HPL are you using?

The actual error is coming from NVShmem which, per the install guide, only supports the following networks: - InfiniBand/RoCE with a Mellanox adapter (CX-4 or later) - Slingshot-11 (Libfabric CXI provider) - Amazon EFA (Libfabric EFA provider).

What I don’t know is if NVShmem is being used directly within the HPL version you’re using, or coming from your MPI.

If it’s part of HPL, then you’ll likely need to find a different version. I don’t run HPL myself, but can ask others if you need advice on which version to run.

If it’s being used by your MPI, for example if you’re using the HPC-X that we ship as part of the NVHPC SDK, then you might try switching to OpenMPI 3.1.5, which would be found under your NVHPC install’s “comm_libs/openmpi/openmpi-3.1.5” directory.

-Mat

Here’s my version.

MPI version is 4.1.2
hpc-benchmarks : 24.0.3.sif

My second approach was to install HPC-SDK. Then, I loaded the “nvhpc-openmpi3” module and tried to run the above command, but it got stuck.

just in case here my module version.

  1. nvhpc → mpi version is 4.1.7a1
  2. nvhpc-hpcx → same as 4.1.7a1
  3. nvhpc-openmpi3 → 3.1.5

**We succeeded in running with 21.4.sif using 1 GPU each on 2 nodes. However, we have to test under 8 GPU each on 2 nodes. It still shows these errors.

Here’s our new approach.

srun -N 2 --ntasks-per-node=4 \
     --mpi=pmi2 \
     singularity run --nv \
     --env UCX_NET_DEVICES=bond0\
     --env UCX_TLS=tcp,sockcm \
     -B "${MOUNT}" "${CONT}" \
     /workspace/hpl-linux-x86_64/hpl.sh \
     --dat /my-dat-files/HPL_please.dat \
     --cpu-affinity 0-3:32-35:64-67:96-99 --cpu-cores-per-rank 1 --gpu-affinity 0:1:2:3

error code is

cpu and/or gpu values not set

Thank you for your help in advance.

Hi Rabbins03,

It’s possible to run NVIDIA-HPL without NVSHMEM.
Please set environment variable HPL_USE_NVSHMEM=0 to disable NVSHMEM

1 Like

Hello, May I ask one more?
I managed to run my test on 2 nodes, but the performance is significantly poorer compared to a single node.

Here’s my command.

mpirun -np 16 -hostfile ./hosts \
    -x HPL_USE_NVSHMEM=0 \
    -x NCCL_P2P_NET_CHUNKSIZE=67108864 -x NCCL_SOCKET_NTHREADS=4 \
    singularity run --nv \
    -B "${MOUNT}" "${CONT}" \
    /workspace/hpl.sh --dat /my-dat-files/HPL_multinode.dat

my dat files

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
397312   	 Ns
1            # of NBs
512          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
16            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

I think network overhead may degrade performance. How do I improve this?