I am attempting to run the HPL docker image on 1 node of a cluster.
After tweaking the hpl.sh script as described in this forum post, I was able to execute it interactively with
# docker run --gpus all -ti nvcr.io/nvidia/hpc-benchmarks:21.4-hpl
Followed by:
mpirun -n 8 hpl.sh --cpu-affinity 3:3:1:1:7:7:5:5 --gpu-affinity 0:1:2:3:4:5:6:7 --cpu-cores-per-rank 8 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat
And it would start executing fine.
It would do all the pre-benchmark checks and then start on the benchmark itself.
All goes well until the very end when an MPI_Recv error occurs, the program crashes and then the kernel reports a ‘soft lockup’ which requires me to reboot the node.
Below is the last progress report followed by the MPI error message
Prog= 99.89% N_left= 20960 Time= 354.86 Time_left= 0.40 iGF= 3341.79 GF= 15229.63 iGF_per= 417.72 GF_per= 1903.70
[scc-gpu01:06096] *** An error occurred in MPI_Recv
[scc-gpu01:06096] *** reported by process [1692532737,2]
[scc-gpu01:06096] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[scc-gpu01:06096] *** MPI_ERR_TRUNCATE: message truncated
[scc-gpu01:06096] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[scc-gpu01:06096] *** and potentially your MPI job)
[scc-gpu01:06042] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[scc-gpu01:06042] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
And below is the kernel message that follows, this appears every 30 seconds on all terminals of those logged in
Message from syslogd@scc-gpu01-prod at Mar 15 16:59:24 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [xhpl:6096]
This is on a single node with 8 NVIDIA A100X and 2 AMD EPYC 7713 CPUs using the latest versions of the docker image running CentOS 7.9
[alexw@scc-gpu02 ~]$ lspci | grep "controller: NVIDIA"
07:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
48:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
4c:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
88:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
8b:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
c8:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
cb:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
I have also tried extracting the xhpl
binary, the hpl.sh
script and one of the dat files to run outside of the docker image and the same error still persists.
I have also tried it with the HPL-20.10 version and the same thing happens.
HPCG-21.4 runs fine and does not cause this issue. The HPCG command uses the same GPU and CPU affinity and cores per rank.
I’m at a loss with this, any help is appreciated.
Regards