Would like some help in running the xhpl 21.4 container on slurm

bettingermm · November 4, 2022, 7:27pm

Hello we are trying to run the 21.4 hpl container against 2 nodes with 2 tesla v100s per node in slurm. Unfortunately we are running into some problems that I am hoping someone can assist with. The only way I can get the container to run with sbatch is like this, but the job fails eventually.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --job-name “test-hpl-ai.1N”
#SBATCH --partition=v100
#SBATCH --time=40:00
#SBATCH --output=slurm-%x.%J.%N.out
#SBATCH --mem=0
#SBATCH --gpus-per-node=tesla:2
#SBATCH --exclusive
DATESTRING=date "+%Y-%m-%dT%H:%M:%S"

docker://nvcr.io/nvidia/hpc-benchmarks:21.4-hpl
CONT=“/home/user/working/HPL/NV/nvhpl.sif”
MOUNT=“/home/user/working/HPL/NV/DAT:/DAT”
echo “Running on hosts: $(echo $(scontrol show hostname))”
echo “$DATESTRING”
OMP_NUM_THREADS=1
export CUDA_VISIBLE_DEVICES=2
mpirun -np 16 --allow-run-as-root apptainer run --nv -B “${MOUNT}” “${CONT}” /DAT/hpl.sh --cpu-affinity 0:0:0:0:1:1:1:1 --cpu-cores-per-rank 16 --gpu-affinity 0:0:0:0:1:1:1:1 --dat /DAT/HPL.dat
echo “Done”
echo “$DATESTRING”

The job submits however it eventually fails during the run with :

!!! WARNING: RANK: 0 HOST: narsil-gpu2 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1873
!!! WARNING: RANK: 8 HOST: narsil-gpu3 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1831
!!! WARNING: RANK: 6 HOST: narsil-gpu2 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB= 768 1866
!!! WARNING: RANK: 4 HOST: narsil-gpu2 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB= 768 1883
!!! WARNING: RANK: 2 HOST: narsil-gpu2 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1783
!!! WARNING: RANK: 10 HOST: narsil-gpu3 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1824
!!! WARNING: RANK: 11 HOST: narsil-gpu3 GPU: 0000:27:00.0 GPU_FP [GFLPS] @NB= 768 1884
NB = 896 1689 2239 1878
!!! WARNING: RANK: 14 HOST: narsil-gpu3 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB= 896 1689
!!! WARNING: RANK: 14 HOST: narsil-gpu3 GPU: 0000:a3:00.0 GPU_FP [GFLPS] @NB=1024 1676
NB = 1024 1676 2136 1864
NET :
PROC COL NET_BW [MB/s ]
[narsil-gpu3:2849992:0:2849992] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[narsil-gpu3:2850055:0:2850055] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[narsil-gpu3:2850074:0:2850074] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x177ff9e)
[narsil-gpu3:2850082] * An error occurred in MPI_Sendrecv
[narsil-gpu3:2850082] * reported by process [3518562305,12]
[narsil-gpu3:2850082] * on communicator MPI_COMM_WORLD
[narsil-gpu3:2850082] * MPI_ERR_COMM: invalid communicator
[narsil-gpu3:2850082] * MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[narsil-gpu3:2850082] * and potentially your MPI job)
[narsil-gpu3:2849970:0:2849970] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfc)
[narsil-gpu3:2850019:0:2850019] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x177ff9e)
[narsil-gpu3:2850004:0:2850004] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2849992) ====
0 0x0000000000024e35 ucs_debug_print_backtrace() /var/tmp/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012980 __funlockfile() ???:0
2 0x000000000008e8db PMPI_Sendrecv() ???:0
3 0x000000000040d5bb ???() /workspace/hpl-linux-x86_64/xhpl:0
4 0x0000000000021bf7 __libc_start_main() ???:0
5 0x000000000040f4a9 ???() /workspace/hpl-linux-x86_64/xhpl:0

[narsil-gpu3:2849992] * Process received signal *
[narsil-gpu3:2849992] Signal: Segmentation fault (11)
[narsil-gpu3:2849992] Signal code: (-6)
[narsil-gpu3:2849992] Failing at address: 0x2b7cc8
[narsil-gpu3:2849992] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x1462b9cc5980]
[narsil-gpu3:2849992] [ 1] /usr/local/openmpi/lib/libmpi.so.40(MPI_Sendrecv+0x15b)[0x1462b9f608db]
[narsil-gpu3:2849992] [ 2] /workspace/hpl-linux-x86_64/xhpl[0x40d5bb]
[narsil-gpu3:2849992] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x1462b98e3bf7]
[narsil-gpu3:2849992] [ 4] /workspace/hpl-linux-x86_64/xhpl[0x40f4a9]
[narsil-gpu3:2849992] * End of error message *
==== backtrace (tid:2850055) ====
0 0x0000000000024e35 ucs_debug_print_backtrace() /var/tmp/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012980 __funlockfile() ???:0
2 0x000000000008e8db PMPI_Sendrecv() ???:0
3 0x000000000040d5bb ???() /workspace/hpl-linux-x86_64/xhpl:0
4 0x0000000000021bf7 __libc_start_main() ???:0
5 0x000000000040f4a9 ???() /workspace/hpl-linux-x86_64/xhpl:0

[narsil-gpu3:2850055] * Process received signal *
[narsil-gpu3:2850055] Signal: Segmentation fault (11)
[narsil-gpu3:2850055] Signal code: (-6)
[narsil-gpu3:2850055] Failing at address: 0x2b7d07
[narsil-gpu3:2850055] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x14965ab87980]
[narsil-gpu3:2850055] [ 1] /usr/local/openmpi/lib/libmpi.so.40(MPI_Sendrecv+0x15b)[0x14965ae228db]
[narsil-gpu3:2850055] [ 2] /workspace/hpl-linux-x86_64/xhpl[0x40d5bb]
[narsil-gpu3:2850055] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x14965a7a5bf7]
[narsil-gpu3:2850055] [ 4] /workspace/hpl-linux-x86_64/xhpl[0x40f4a9]
[narsil-gpu3:2850055] * End of error message *
==== backtrace (tid:2850074) ====
0 0x0000000000024e35 ucs_debug_print_backtrace() /var/tmp/ucx-1.10.0/src/ucs/debug/debug.c:656
1 0x0000000000012980 __funlockfile() ???:0
2 0x000000000008e8fe PMPI_Sendrecv() ???:0
3 0x000000000040d5bb ???() /workspace/hpl-linux-x86_64/xhpl:0
4 0x0000000000021bf7 __libc_start_main() ???:0
5 0x000000000040f4a9 ???() /workspace/hpl-linux-x86_64/xhpl:0

[narsil-gpu3:2850074] *** Process received signal ***
[narsil-gpu3:2850074] Signal: Segmentation fault (11)
[narsil-gpu3:2850074] Signal code: (-6)
[narsil-gpu3:2850074] Failing at address: 0x2b7d1a
[narsil-gpu3:2850074] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x146c3e879980]
[narsil-gpu3:2850074] [ 1] /usr/local/openmpi/lib/libmpi.so.40(MPI_Sendrecv+0x17e)[0x146c3eb148fe]
[narsil-gpu3:2850074] [ 2] /workspace/hpl-linux-x86_64/xhpl[0x40d5bb]
[narsil-gpu3:2850074] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x146c3e497bf7]
[narsil-gpu3:2850074] [ 4] /workspace/hpl-linux-x86_64/xhpl[0x40f4a9]

If I try submitting with srun we receive this:

[root@host-gpu-login3 NV]# CONT=‘/home/user/working/HPL/NV/nvhpl.sif’
[root@host-gpu-login3 NV]# MOUNT=“/home/user/working/HPL/NV/DAT:/DAT”
[root@host-gpu-login3 NV]# srun --mpi=pmi2 -p v100 -N2 -n 16 --cpu-bind=none apptainer run --nv -B “${MOUNT}” “${CONT}” /DAT/hpl.sh --cpu-affinity 0:1:2:3:4:5:6:7 --cpu-cores-per-rank 8 --gpu-affinity 0:0:0:0:1:1:1:1 --dat /DAT/HPL.dat
srun: job 48117 queued and waiting for resources
srun: job 48117 has been allocated resources
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /etc/localtime required more than 50 (87) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (263) bind mounts
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.
Detected MOFED 5.6-2.0.9.