HPC Container HPL-21.4 MPI_Recv error

I am attempting to run the HPL docker image on 1 node of a cluster.

After tweaking the hpl.sh script as described in this forum post, I was able to execute it interactively with

# docker run --gpus all -ti  nvcr.io/nvidia/hpc-benchmarks:21.4-hpl

Followed by:

mpirun -n 8 hpl.sh --cpu-affinity 3:3:1:1:7:7:5:5 --gpu-affinity 0:1:2:3:4:5:6:7 --cpu-cores-per-rank 8 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

And it would start executing fine.
It would do all the pre-benchmark checks and then start on the benchmark itself.
All goes well until the very end when an MPI_Recv error occurs, the program crashes and then the kernel reports a ‘soft lockup’ which requires me to reboot the node.
Below is the last progress report followed by the MPI error message

 Prog= 99.89%   N_left= 20960   Time= 354.86    Time_left= 0.40 iGF=  3341.79   GF= 15229.63    iGF_per= 417.72         GF_per= 1903.70
[scc-gpu01:06096] *** An error occurred in MPI_Recv
[scc-gpu01:06096] *** reported by process [1692532737,2]
[scc-gpu01:06096] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 4
[scc-gpu01:06096] *** MPI_ERR_TRUNCATE: message truncated
[scc-gpu01:06096] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[scc-gpu01:06096] ***    and potentially your MPI job)
[scc-gpu01:06042] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[scc-gpu01:06042] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

And below is the kernel message that follows, this appears every 30 seconds on all terminals of those logged in

Message from syslogd@scc-gpu01-prod at Mar 15 16:59:24 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [xhpl:6096]

This is on a single node with 8 NVIDIA A100X and 2 AMD EPYC 7713 CPUs using the latest versions of the docker image running CentOS 7.9

​[alexw@scc-gpu02 ~]$ lspci | grep "controller: NVIDIA"
07:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
0b:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
48:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
4c:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
88:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
8b:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
c8:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
cb:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)

I have also tried extracting the xhpl binary, the hpl.sh script and one of the dat files to run outside of the docker image and the same error still persists.

I have also tried it with the HPL-20.10 version and the same thing happens.
HPCG-21.4 runs fine and does not cause this issue. The HPCG command uses the same GPU and CPU affinity and cores per rank.

I’m at a loss with this, any help is appreciated.
Regards

Hi,

Assuming that your system has similar topology as canonical HGX-A100, then I think the options are correct.

I suggest using HPL-dgx-a100-1N.dat only as a reference and adopting it to your own system.
For instance, does the problem persists if you significantly reduce matrix size and use 1rg as BCAST ?

...
100000 Ns 
...
1    # of BCASTs 
0    BCASTs (0=1rg) 
...

I also suggest performing incremental test starting with one GPU (N = 60000), then add 10000 per addition GPU. You can adjust N incrementally to find the threshold at which the soft lock happens.

If the problem persists, perhaps you can try disabling transparent huge page (THP)

This is just a hunch, btw. I hope it helps.

Thanks for the reply, as suggested I set N to 60000, P and Q for the grid as both 1, BCASTS to 0 as so:

It got to 100% this time without an MPI error, started to print the results and then soft locked again.

 Prog= 100.00%  N_left= 96      Time= 22.35     Time_left= 0.00 iGF=   599.42
   GF=  6443.03    iGF_per= 599.42    GF_per= 6443.03 
2022-03-18 11:18:47.753
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00L2L2       60000   288     1     1              22.79              6.320e+03 
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 337668043.9303811 ...... FAILED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . =         204.218002
||A||_oo . . . . . . . . . . . . . . . . . . . =       15149.169010
||A||_1  . . . . . . . . . . . . . . . . . . . =       15157.944015
||x||_oo . . . . . . . . . . . . . . . . . . . =           5.993098
||x||_1  . . . . . . . . . . . . . . . . . . . =       40776.759350
||b||_oo . . . . . . . . . . . . . . . . . . . =           0.499997

Message from syslogd@scc-gpu01-prod at Mar 18 11:19:20 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#176 stuck for 22s! [xhpl:6055]

This is on a HPE apollo server if that helps at all.

The .dat file is as follows:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
60000        Ns
1            # of NBs
288          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1 0          DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
0            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

run with command:

mpirun -n 1 ./hpl.sh --cpu-affinity 0 --gpu-affinity 0 --cpu-cores-per-rank 1 --dat HPL.dat 

Disabling THP still causes the same issue while printing the results.

Sorry for belated follow up.

While the calculation reached 100%, the post verification still shows ‘FAILED’ status.
IMHO, this should be the case for N=60000. For reference, below are my output.

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00L2L2       60000   288     1     1              24.20              5.952e+03
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0037689 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

I am running CentOS Linux release 7.9.2009 with 3.10.0-1160.el7.x86_64 kernel.
Are you running a 4.x kernel by any chance ?

You did mention that there were no problem with HPCG benchmark.
With a 256 x 256 x 256 input grid, did the perform come as good as you expected ?
I believe it should be in the vicinity of ~ 250 GFlops in total per A100 GPU.

I am also running CentOS Linux release 7.9.2009 (Core) with 3.10.0-1160.59.1.el7.x86_64 kernel.

Running HPCG works fine. If I run it with 1 process, 1 thread and 1 GPU as so:

mpirun -n 1 ./hpcg.sh --cpu-affinity 0 --gpu-affinity 0 --cpu-cores-per-rank 1 --dat HPCG.dat 

Where HPCG.dat is:

HPCG benchmark input file
Sandia National Laboratories; University of Tennessee, Knoxville
256 256 256
180

Then it runs fine and the final output is as follows:

Completed Benchmarking Phase... elapsed time:  180.1 seconds 
2022-03-23 15:46:32.292

Number of CG sets:      133 
Iterations per set:     52 
scaled res mean:        5.595820e-04 
scaled res variance:    0.000000e+00 

Total Time: 1.801370e+02 sec 
Setup        Overhead: 2.33%
Optimization Overhead: 0.78%
Convergence  Overhead: 3.85%

1x1x1 process grid
256x256x256 local domain
SpMV  =  197.3 GF (1242.7 GB/s Effective)  197.3 GF_per (1242.7 GB/s Effective)
SymGS =  256.0 GF (1976.1 GB/s Effective)  256.0 GF_per (1976.1 GB/s Effective)
total =  239.7 GF (1817.8 GB/s Effective)  239.7 GF_per (1817.8 GB/s Effective)
final =  223.4 GF (1694.0 GB/s Effective)  223.4 GF_per (1694.0 GB/s Effective)

end of application...
2022-03-23 15:46:32.345

This also runs fine on 8 MPI processes, 8 GPU’s and 8 cores per rank
as

mpirun -n 8 hpcg.sh --cpu-affinity 3:3:1:1:7:7:5:5 --gpu-affinity 0:1:2:3:4:5:6:7 --cpu-cores-per-rank 8 --dat HPCG.dat

Which gives a performance of:

2x2x2 process grid
256x256x256 local domain
SpMV  = 1563.5 GF (9846.1 GB/s Effective)  195.4 GF_per (1230.8 GB/s Effective)
SymGS = 1697.8 GF (13103.9 GB/s Effective)  212.2 GF_per (1638.0 GB/s Effective)
total = 1571.9 GF (11920.2 GB/s Effective)  196.5 GF_per (1490.0 GB/s Effective)
final = 1414.5 GF (10726.7 GB/s Effective)  176.8 GF_per (1340.8 GB/s Effective)

As above, running the HPL equivalents of these results in the MPI_Recv error/verification failure.

For 1 x A100, 239.7 GF is in the ballpark.
However for 8 x A100, even with PCIe version, I think there seems to be a noticeable performance degradation.
Below is the result for NVLink server. I no longer have access to PCIe server, but I recall the number is ~ 240 GF per GPU as well.

2x2x2 process grid
256x256x256 local domain
SpMV  = 1563.5 GF (9845.9 GB/s Effective)  195.4 GF_per (1230.7 GB/s Effective)
SymGS = 2154.8 GF (16630.5 GB/s Effective)  269.3 GF_per (2078.8 GB/s Effective)
total = 1974.8 GF (14975.5 GB/s Effective)  246.8 GF_per (1871.9 GB/s Effective)
final = 1749.0 GF (13263.6 GB/s Effective)  218.6 GF_per (1657.9 GB/s Effective)

Regarding HPL, unfortunately I am out of ideal at the moment.