Hello, I’am struggling w/ mpirun.
I want to test A100 HPL Benchmark. My node doesn’t support infini-band.
I have 2 nodes with 8 gpu per node.
We are trying to solve this problem in these ways.
using mpirun
mpirun -np 2 -hostfile ./hosts --allow-run-as-root \
singularity run --nv \
-B "HPL.dat" "hpc-benchmarks:24.03.sif" \
/workspace/hpl.sh --dat /my-dat-files/HPL_multinode.dat
The error code is
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/topo/topo.cpp:476: [GPU 0] Peer GPU 1 is not accessible, exiting ...
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:972: non-zero status: 3 building transport map failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
WARN: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Invalid argument, exiting... mutex destroy failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/topo/topo.cpp:476: [GPU 1] Peer GPU 0 is not accessible, exiting ...
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:972: non-zero status: 3 building transport map failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
WARN: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Invalid argument, exiting... mutex destroy failed
hostfile is
gpu01 slots=1
gpu02 slots=1
HPL.dat is
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
17400 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
2 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
I have no idea what problem is. Please help me.
Hi Rabbins03,
Which version of MPI are you using?
What version of HPL are you using?
The actual error is coming from NVShmem which, per the install guide , only supports the following networks: - InfiniBand/RoCE with a Mellanox adapter (CX-4 or later) - Slingshot-11 (Libfabric CXI provider) - Amazon EFA (Libfabric EFA provider).
What I don’t know is if NVShmem is being used directly within the HPL version you’re using, or coming from your MPI.
If it’s part of HPL, then you’ll likely need to find a different version. I don’t run HPL myself, but can ask others if you need advice on which version to run.
If it’s being used by your MPI, for example if you’re using the HPC-X that we ship as part of the NVHPC SDK, then you might try switching to OpenMPI 3.1.5, which would be found under your NVHPC install’s “comm_libs/openmpi/openmpi-3.1.5” directory.
-Mat
Here’s my version.
MPI version is 4.1.2
hpc-benchmarks : 24.0.3.sif
My second approach was to install HPC-SDK. Then, I loaded the “nvhpc-openmpi3” module and tried to run the above command, but it got stuck.
just in case here my module version.
nvhpc → mpi version is 4.1.7a1
nvhpc-hpcx → same as 4.1.7a1
nvhpc-openmpi3 → 3.1.5
**We succeeded in running with 21.4.sif using 1 GPU each on 2 nodes. However, we have to test under 8 GPU each on 2 nodes. It still shows these errors.
Here’s our new approach.
srun -N 2 --ntasks-per-node=4 \
--mpi=pmi2 \
singularity run --nv \
--env UCX_NET_DEVICES=bond0\
--env UCX_TLS=tcp,sockcm \
-B "${MOUNT}" "${CONT}" \
/workspace/hpl-linux-x86_64/hpl.sh \
--dat /my-dat-files/HPL_please.dat \
--cpu-affinity 0-3:32-35:64-67:96-99 --cpu-cores-per-rank 1 --gpu-affinity 0:1:2:3
error code is
cpu and/or gpu values not set
Thank you for your help in advance.
Hi Rabbins03,
It’s possible to run NVIDIA-HPL without NVSHMEM.
Please set environment variable HPL_USE_NVSHMEM=0 to disable NVSHMEM
1 Like
Hello, May I ask one more?
I managed to run my test on 2 nodes, but the performance is significantly poorer compared to a single node.
Here’s my command.
mpirun -np 16 -hostfile ./hosts \
-x HPL_USE_NVSHMEM=0 \
-x NCCL_P2P_NET_CHUNKSIZE=67108864 -x NCCL_SOCKET_NTHREADS=4 \
singularity run --nv \
-B "${MOUNT}" "${CONT}" \
/workspace/hpl.sh --dat /my-dat-files/HPL_multinode.dat
my dat files
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
397312 Ns
1 # of NBs
512 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
16 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
I think network overhead may degrade performance. How do I improve this?
system
Closed
June 25, 2024, 1:42am
6
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.