How to run HPL script over Ethernet

rabbins03 · May 30, 2024, 3:08pm

Hello, I’am struggling w/ mpirun.
I want to test A100 HPL Benchmark. My node doesn’t support infini-band.
I have 2 nodes with 8 gpu per node.

We are trying to solve this problem in these ways.

using mpirun

mpirun -np 2 -hostfile ./hosts  --allow-run-as-root \
    singularity run --nv \
     -B "HPL.dat" "hpc-benchmarks:24.03.sif" \
     /workspace/hpl.sh --dat /my-dat-files/HPL_multinode.dat

The error code is

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/topo/topo.cpp:476: [GPU 0] Peer GPU 1 is not accessible, exiting ... 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:972: non-zero status: 3 building transport map failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting 

WARN: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Invalid argument, exiting... mutex destroy failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/topo/topo.cpp:476: [GPU 1] Peer GPU 0 is not accessible, exiting ... 
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:972: non-zero status: 3 building transport map failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting 

WARN: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Invalid argument, exiting... mutex destroy failed

hostfile is

gpu01 slots=1
gpu02 slots=1

HPL.dat is

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
17400   	 Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

I have no idea what problem is. Please help me.

MatColgrove · May 30, 2024, 4:15pm

Hi Rabbins03,

Which version of MPI are you using?
What version of HPL are you using?

The actual error is coming from NVShmem which, per the install guide, only supports the following networks: - InfiniBand/RoCE with a Mellanox adapter (CX-4 or later) - Slingshot-11 (Libfabric CXI provider) - Amazon EFA (Libfabric EFA provider).

What I don’t know is if NVShmem is being used directly within the HPL version you’re using, or coming from your MPI.

If it’s part of HPL, then you’ll likely need to find a different version. I don’t run HPL myself, but can ask others if you need advice on which version to run.

If it’s being used by your MPI, for example if you’re using the HPC-X that we ship as part of the NVHPC SDK, then you might try switching to OpenMPI 3.1.5, which would be found under your NVHPC install’s “comm_libs/openmpi/openmpi-3.1.5” directory.

-Mat

rabbins03 · May 31, 2024, 4:10am

Here’s my version.

MPI version is 4.1.2
hpc-benchmarks : 24.0.3.sif

My second approach was to install HPC-SDK. Then, I loaded the “nvhpc-openmpi3” module and tried to run the above command, but it got stuck.

just in case here my module version.

nvhpc → mpi version is 4.1.7a1
nvhpc-hpcx → same as 4.1.7a1
nvhpc-openmpi3 → 3.1.5

**We succeeded in running with 21.4.sif using 1 GPU each on 2 nodes. However, we have to test under 8 GPU each on 2 nodes. It still shows these errors.

Here’s our new approach.

srun -N 2 --ntasks-per-node=4 \
     --mpi=pmi2 \
     singularity run --nv \
     --env UCX_NET_DEVICES=bond0\
     --env UCX_TLS=tcp,sockcm \
     -B "${MOUNT}" "${CONT}" \
     /workspace/hpl-linux-x86_64/hpl.sh \
     --dat /my-dat-files/HPL_please.dat \
     --cpu-affinity 0-3:32-35:64-67:96-99 --cpu-cores-per-rank 1 --gpu-affinity 0:1:2:3

error code is

cpu and/or gpu values not set

Thank you for your help in advance.

nshustrov · June 4, 2024, 4:48pm

Hi Rabbins03,

It’s possible to run NVIDIA-HPL without NVSHMEM.
Please set environment variable HPL_USE_NVSHMEM=0 to disable NVSHMEM

rabbins03 · June 11, 2024, 1:41am

Hello, May I ask one more?
I managed to run my test on 2 nodes, but the performance is significantly poorer compared to a single node.

Here’s my command.

mpirun -np 16 -hostfile ./hosts \
    -x HPL_USE_NVSHMEM=0 \
    -x NCCL_P2P_NET_CHUNKSIZE=67108864 -x NCCL_SOCKET_NTHREADS=4 \
    singularity run --nv \
    -B "${MOUNT}" "${CONT}" \
    /workspace/hpl.sh --dat /my-dat-files/HPL_multinode.dat

my dat files

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
397312   	 Ns
1            # of NBs
512          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
16            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
2            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

I think network overhead may degrade performance. How do I improve this?

system · June 25, 2024, 1:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error while running NVIDIA HPL benchmark for H100 GPU-Accelerated Libraries	1	1352	April 2, 2024
HPL CUDA Programming and Performance	11	42456	July 18, 2011
Settings for HPL CUDA Programming and Performance	7	4414	February 13, 2012
CUDA Cluster - HPL Help CUDA Programming and Performance	1	1500	October 3, 2013
Running Fermi-HPL (not using GPUs) Fermi-HPL benchmark not using Gpus CUDA Programming and Performance	5	2762	April 23, 2012
Error running HPL on mutiple nodes GPU-Accelerated Libraries	1	119	March 13, 2025
Problem running mpirun from head node in cluster Legacy PGI Compilers	5	24841	March 22, 2007
Run HPL on 4x A100 CUDA Programming and Performance	3	3131	July 17, 2021
Terrible HPL performance Legacy PGI Compilers	4	22227	August 21, 2008
HPL and Tesla C1060: Not Enough GPUs problem Problem when running HPL. CUDA Programming and Performance	6	1817	September 15, 2011

How to run HPL script over Ethernet

Related topics