Hello we are trying to run the 21.4 hpl container against 2 nodes with 2 tesla v100s per node in slurm. Unfortunately we are running into some problems that I am hoping someone can assist with. The only way I can get the container to run with sbatch is like this, but the job fails eventually.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --job-name “test-hpl-ai.1N”
#SBATCH --partition=v100
#SBATCH --time=40:00
#SBATCH --output=slurm-%x.%J.%N.out
#SBATCH --mem=0
#SBATCH --gpus-per-node=tesla:2
#SBATCH --exclusive
DATESTRING=date "+%Y-%m-%dT%H:%M:%S"
docker://nvcr.io/nvidia/hpc-benchmarks:21.4-hpl
CONT=“/home/user/working/HPL/NV/nvhpl.sif”
MOUNT=“/home/user/working/HPL/NV/DAT:/DAT”
echo “Running on hosts: $(echo $(scontrol show hostname))”
echo “$DATESTRING”
OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=2
mpirun -np 16 --allow-run-as-root apptainer run --nv -B “${MOUNT}” “${CONT}” /DAT/hpl.sh --cpu-affinity 0:0:0:0:1:1:1:1 --cpu-cores-per-rank 16 --gpu-affinity 0:0:0:0:1:1:1:1 --dat /DAT/HPL.dat
echo “Done”
echo “$DATESTRING”
The job submits however it eventually fails during the run with :
!!! WARNING: RANK: 0 HOST: narsil-gpu2 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1873
!!! WARNING: RANK: 8 HOST: narsil-gpu3 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1831
!!! WARNING: RANK: 6 HOST: narsil-gpu2 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB= 768 1866
!!! WARNING: RANK: 4 HOST: narsil-gpu2 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB= 768 1883
!!! WARNING: RANK: 2 HOST: narsil-gpu2 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1783
!!! WARNING: RANK: 10 HOST: narsil-gpu3 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1824
!!! WARNING: RANK: 11 HOST: narsil-gpu3 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1884
NB = 896 1689 2239 1878
!!! WARNING: RANK: 14 HOST: narsil-gpu3 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB= 896 1689
!!! WARNING: RANK: 14 HOST: narsil-gpu3 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB=1024 1676
NB = 1024 1676 2136 1864
NET :
PROC COL NET_BW [MB/s ]
[narsil-gpu3:2849992:0:2849992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[narsil-gpu3:2850055:0:2850055] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[narsil-gpu3:2850074:0:2850074] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x177ff9e)
[narsil-gpu3:2850082] *** An error occurred in MPI_Sendrecv
[narsil-gpu3:2850082] *** reported by process [3518562305,12]
[narsil-gpu3:2850082] *** on communicator MPI_COMM_WORLD
[narsil-gpu3:2850082] *** MPI_ERR_COMM: invalid communicator
[narsil-gpu3:2850082] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[narsil-gpu3:2850082] *** and potentially your MPI job)
[narsil-gpu3:2849970:0:2849970] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[narsil-gpu3:2850019:0:2850019] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x177ff9e)
[narsil-gpu3:2850004:0:2850004] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2849992) ====
0 0x0000000000024e35 ucs_debug_print_backtrace() /var/tmp/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012980 __funlockfile() ???:0
2 0x000000000008e8db PMPI_Sendrecv() ???:0
3 0x000000000040d5bb ???() /workspace/hpl-linux-x86_64/xhpl:0
4 0x0000000000021bf7 __libc_start_main() ???:0
5 0x000000000040f4a9 ???() /workspace/hpl-linux-x86_64/xhpl:0
[narsil-gpu3:2849992] *** Process received signal ***
[narsil-gpu3:2849992] Signal: Segmentation fault (11)
[narsil-gpu3:2849992] Signal code: (-6)
[narsil-gpu3:2849992] Failing at address: 0x2b7cc8
[narsil-gpu3:2849992] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x1462b9cc5980]
[narsil-gpu3:2849992] [ 1] /usr/local/openmpi/lib/libmpi.so.40(MPI_Sendrecv+0x15b)[0x1462b9f608db]
[narsil-gpu3:2849992] [ 2] /workspace/hpl-linux-x86_64/xhpl[0x40d5bb]
[narsil-gpu3:2849992] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x1462b98e3bf7]
[narsil-gpu3:2849992] [ 4] /workspace/hpl-linux-x86_64/xhpl[0x40f4a9]
[narsil-gpu3:2849992] *** End of error message ***
==== backtrace (tid:2850055) ====
0 0x0000000000024e35 ucs_debug_print_backtrace() /var/tmp/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012980 __funlockfile() ???:0
2 0x000000000008e8db PMPI_Sendrecv() ???:0
3 0x000000000040d5bb ???() /workspace/hpl-linux-x86_64/xhpl:0
4 0x0000000000021bf7 __libc_start_main() ???:0
5 0x000000000040f4a9 ???() /workspace/hpl-linux-x86_64/xhpl:0
[narsil-gpu3:2850055] *** Process received signal ***
[narsil-gpu3:2850055] Signal: Segmentation fault (11)
[narsil-gpu3:2850055] Signal code: (-6)
[narsil-gpu3:2850055] Failing at address: 0x2b7d07
[narsil-gpu3:2850055] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x14965ab87980]
[narsil-gpu3:2850055] [ 1] /usr/local/openmpi/lib/libmpi.so.40(MPI_Sendrecv+0x15b)[0x14965ae228db]
[narsil-gpu3:2850055] [ 2] /workspace/hpl-linux-x86_64/xhpl[0x40d5bb]
[narsil-gpu3:2850055] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x14965a7a5bf7]
[narsil-gpu3:2850055] [ 4] /workspace/hpl-linux-x86_64/xhpl[0x40f4a9]
[narsil-gpu3:2850055] *** End of error message ***
==== backtrace (tid:2850074) ====
0 0x0000000000024e35 ucs_debug_print_backtrace() /var/tmp/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012980 __funlockfile() ???:0
2 0x000000000008e8fe PMPI_Sendrecv() ???:0
3 0x000000000040d5bb ???() /workspace/hpl-linux-x86_64/xhpl:0
4 0x0000000000021bf7 __libc_start_main() ???:0
5 0x000000000040f4a9 ???() /workspace/hpl-linux-x86_64/xhpl:0
[narsil-gpu3:2850074] *** Process received signal ***
[narsil-gpu3:2850074] Signal: Segmentation fault (11)
[narsil-gpu3:2850074] Signal code: (-6)
[narsil-gpu3:2850074] Failing at address: 0x2b7d1a
[narsil-gpu3:2850074] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x146c3e879980]
[narsil-gpu3:2850074] [ 1] /usr/local/openmpi/lib/libmpi.so.40(MPI_Sendrecv+0x17e)[0x146c3eb148fe]
[narsil-gpu3:2850074] [ 2] /workspace/hpl-linux-x86_64/xhpl[0x40d5bb]
[narsil-gpu3:2850074] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x146c3e497bf7]
[narsil-gpu3:2850074] [ 4] /workspace/hpl-linux-x86_64/xhpl[0x40f4a9]
If I try submitting with srun we receive this:
[root@host-gpu-login3 NV]# CONT=‘/home/user/working/HPL/NV/nvhpl.sif’
[root@host-gpu-login3 NV]# MOUNT=“/home/user/working/HPL/NV/DAT:/DAT”
[root@host-gpu-login3 NV]# srun --mpi=pmi2 -p v100 -N2 -n 16 --cpu-bind=none apptainer run --nv -B “${MOUNT}” “${CONT}” /DAT/hpl.sh --cpu-affinity 0:1:2:3:4:5:6:7 --cpu-cores-per-rank 8 --gpu-affinity 0:0:0:0:1:1:1:1 --dat /DAT/HPL.dat
srun: job 48117 queued and waiting for resources
srun: job 48117 has been allocated resources
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
Detected MOFED 5.6-2.0.9.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
INFO: host=host-gpu2 rank=2 lrank=2 cores=8 gpu=0 cpu=2 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=5 lrank=5 cores=8 gpu=1 cpu=5 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=1 lrank=1 cores=8 gpu=0 cpu=1 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=7 lrank=7 cores=8 gpu=1 cpu=7 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=3 lrank=3 cores=8 gpu=0 cpu=3 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=6 lrank=6 cores=8 gpu=1 cpu=6 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=4 lrank=4 cores=8 gpu=1 cpu=4 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu2 rank=0 lrank=0 cores=8 gpu=0 cpu=0 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
INFO: host=host-gpu3 rank=12 lrank=4 cores=8 gpu=1 cpu=4 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=15 lrank=7 cores=8 gpu=1 cpu=7 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=8 lrank=0 cores=8 gpu=0 cpu=0 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=13 lrank=5 cores=8 gpu=1 cpu=5 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=9 lrank=1 cores=8 gpu=0 cpu=1 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=10 lrank=2 cores=8 gpu=0 cpu=2 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=11 lrank=3 cores=8 gpu=0 cpu=3 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=host-gpu3 rank=14 lrank=6 cores=8 gpu=1 cpu=6 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
================================================================================
HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA
HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 20960
NB : 288
PMAP : Row-major process mapping
P : 4
Q : 2
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 2 double precision words
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
[host-gpu3:2777409:0:2777409] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[host-gpu3:2777411:0:2777411] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[host-gpu3:2777412:0:2777412] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
!!! WARNING: RANK: 12 HOST: host-gpu3 GPU: 0000:a3:00.0 GPU_BW [GB/s ] 275
[host-gpu3:2777407:0:2777407] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[host-gpu3:2777410:0:2777410] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
!!! WARNING: RANK: 14 HOST: host-gpu3 GPU: 0000:a3:00.0 GPU_BW [GB/s ] 184
[host-gpu3:2777414:0:2777414] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[host-gpu3:2777413] *** An error occurred in MPI_Sendrecv
[host-gpu3:2777413] *** reported by process [3153395712,11]
[host-gpu3:2777413] *** on communicator MPI_COMM_WORLD
[host-gpu3:2777413] *** MPI_ERR_COMM: invalid communicator
[host-gpu3:2777413] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-gpu3:2777413] *** and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 48117.0 ON host-gpu2 CANCELLED AT 2022-11-04T09:10:00 ***
srun: error: host-gpu2: tasks 0-7: Killed
srun: error: host-gpu3: tasks 8-15: Killed
The HPL.dat file we are using (probably wrong) is :
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
960 Ns
1 # of NBs
288 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
4 Ps
2 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
2 memory alignment in double (> 0)