Run HPL on 4x A100

I wannt run HPL on standalone machine with two 64 cores AMD Epyc and 4 x A100:

[root@epyc hpl]# lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
41:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
c1:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)

All A100 are available in docker:

[root@epyc hpl]# docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Sat Jul 17 12:58:58 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM… On | 00000000:01:00.0 Off | 0 |
| N/A 22C P0 49W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM… On | 00000000:41:00.0 Off | 0 |
| N/A 21C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM… On | 00000000:81:00.0 Off | 0 |
| N/A 22C P0 49W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM… On | 00000000:C1:00.0 Off | 0 |
| N/A 22C P0 50W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I use script to run HPL:
[root@epyc hpl]# cat docker-run.sh
CONT=‘nvcr.io/nvidia/hpc-benchmarks:21.4-hpl

docker run --gpus all --security-opt seccomp=seccomp.json
${CONT}
mpirun --bind-to none -np 8
hpl.sh --config dgx-a100 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

When I run it, I caught some errors:

[root@epyc hpl]# cat docker-run.sh
CONT=‘nvcr.io/nvidia/hpc-benchmarks:21.4-hpl

docker run --gpus all --security-opt seccomp=seccomp.json
${CONT}
mpirun --bind-to none -np 8
hpl.sh --config dgx-a100 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat
[root@epyc hpl]#
[root@epyc hpl]#
[root@epyc hpl]#
[root@epyc hpl]#
[root@epyc hpl]# ./docker-run.sh
WARNING: No InfiniBand devices detected.
Multi-node communication performance may be reduced.

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

INFO: host=a8369df39cce rank=0 lrank=0 cores=16 gpu=0 cpu=32-47 mem=2 net=mlx5_0:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=2 lrank=2 cores=16 gpu=2 cpu=0-15 mem=0 net=mlx5_2:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=4 lrank=4 cores=16 gpu=4 cpu=96-111 mem=6 net=mlx5_6:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=6 lrank=6 cores=16 gpu=6 cpu=64-79 mem=4 net=mlx5_8:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=5 lrank=5 cores=16 gpu=5 cpu=112-127 mem=7 net=mlx5_7:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=1 lrank=1 cores=16 gpu=1 cpu=48-63 mem=3 net=mlx5_1:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=7 lrank=7 cores=16 gpu=7 cpu=80-95 mem=5 net=mlx5_9:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=3 lrank=3 cores=16 gpu=3 cpu=16-31 mem=1 net=mlx5_3:1 bin=/workspace/hpl-linux-x86_64/xhpl
[1626526959.279421] [a8369df39cce:76 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_0:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.362147] [a8369df39cce:127 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_9:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.386758] [a8369df39cce:121 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_8:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.409308] [a8369df39cce:85 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_2:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.410425] [a8369df39cce:118 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_6:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.542911] [a8369df39cce:125 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_7:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.588440] [a8369df39cce:126 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_1:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.676905] [a8369df39cce:129 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_3:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)

================================================================================
HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 200960
NB : 288
PMAP : Row-major process mapping
P : 4
Q : 2
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.
  • The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
  • The relative machine precision (eps) is taken to be 1.110223e-16
  • Computational tests pass if scaled residuals are less than 16.0

trsm_cutoff from environment variable 9000000
gpu_dgemm_split from environment variable 1.000
monitor_gpu from environment variable 1
gpu_temp_warning from environment variable 78
gpu_clock_warning from environment variable 1410
gpu_power_warning from environment variable 400
max_h2d_ms from environment variable 200
max_d2h_ms from environment variable 200
gpu_pcie_gen_warning from environment variable 3
gpu_pcie_width_warning from environment variable 2
test_loops from environment variable 1
test_system from environment variable 1
rank 4 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
rank 6 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
rank 7 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
rank 5 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)

    ******** TESTING SYSTEM PARAMETERS ********
    PARAM   [UNITS]         MIN     MAX     AVG
    -----   -------         ---     ---     ---

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[49809,1],5]
Exit code: 1

[root@epyc hpl]#

What wrong?

When I run HPL

P.S. i use seccomp from there: Allowing numactl in docker container · GitHub

I tried run HPL via Pyxis/Enroot, but it not work too:

[root@epyc hpl]# enroot start nvidia+hpc-benchmarks+21.4-hpl
nvidia-container-cli: mount error: file creation failed: /root/.local/share/enroot/nvidia+hpc-benchmarks+21.4-hpl/run/nvidia-persistenced/socket: no such device or address
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

When running HPL, you use 1 MPI rank per GPU. Furthermore, you’ll probably need to change the problem size to match 4 GPUs not 8 (which is what the container is set up for) and you should change the PxQ process grid, probably to 2x2. The warnings about infiniband can be safely ignored for a single machine.