Run HPL on 4x A100

arachko · July 17, 2021, 1:03pm

I wannt run HPL on standalone machine with two 64 cores AMD Epyc and 4 x A100:

[root@epyc hpl]# lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
41:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
81:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)
c1:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100X] (rev a1)

All A100 are available in docker:

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

I use script to run HPL:
[root@epyc hpl]# cat docker-run.sh
CONT=‘nvcr.io/nvidia/hpc-benchmarks:21.4-hpl’

docker run --gpus all --security-opt seccomp=seccomp.json
${CONT}
mpirun --bind-to none -np 8
hpl.sh --config dgx-a100 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

When I run it, I caught some errors:

[root@epyc hpl]# cat docker-run.sh
CONT=‘nvcr.io/nvidia/hpc-benchmarks:21.4-hpl’

docker run --gpus all --security-opt seccomp=seccomp.json
${CONT}
mpirun --bind-to none -np 8
hpl.sh --config dgx-a100 --dat /workspace/hpl-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat
[root@epyc hpl]#
[root@epyc hpl]#
[root@epyc hpl]#
[root@epyc hpl]#
[root@epyc hpl]# ./docker-run.sh
WARNING: No InfiniBand devices detected.
Multi-node communication performance may be reduced.

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.

INFO: host=a8369df39cce rank=0 lrank=0 cores=16 gpu=0 cpu=32-47 mem=2 net=mlx5_0:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=2 lrank=2 cores=16 gpu=2 cpu=0-15 mem=0 net=mlx5_2:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=4 lrank=4 cores=16 gpu=4 cpu=96-111 mem=6 net=mlx5_6:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=6 lrank=6 cores=16 gpu=6 cpu=64-79 mem=4 net=mlx5_8:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=5 lrank=5 cores=16 gpu=5 cpu=112-127 mem=7 net=mlx5_7:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=1 lrank=1 cores=16 gpu=1 cpu=48-63 mem=3 net=mlx5_1:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=7 lrank=7 cores=16 gpu=7 cpu=80-95 mem=5 net=mlx5_9:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=a8369df39cce rank=3 lrank=3 cores=16 gpu=3 cpu=16-31 mem=1 net=mlx5_3:1 bin=/workspace/hpl-linux-x86_64/xhpl
[1626526959.279421] [a8369df39cce:76 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_0:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.362147] [a8369df39cce:127 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_9:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.386758] [a8369df39cce:121 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_8:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.409308] [a8369df39cce:85 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_2:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.410425] [a8369df39cce:118 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_6:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.542911] [a8369df39cce:125 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_7:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.588440] [a8369df39cce:126 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_1:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)
[1626526959.676905] [a8369df39cce:129 :0] ucp_context.c:775 UCX WARN network device ‘mlx5_3:1’ is not available, please use one or more of: ‘eth0’(tcp), ‘lo’(tcp)

================================================================================
HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 200960
NB : 288
PMAP : Row-major process mapping
P : 4
Q : 2
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words

The matrix A is randomly generated for each test.
The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
The relative machine precision (eps) is taken to be 1.110223e-16
Computational tests pass if scaled residuals are less than 16.0

trsm_cutoff from environment variable 9000000
gpu_dgemm_split from environment variable 1.000
monitor_gpu from environment variable 1
gpu_temp_warning from environment variable 78
gpu_clock_warning from environment variable 1410
gpu_power_warning from environment variable 400
max_h2d_ms from environment variable 200
max_d2h_ms from environment variable 200
gpu_pcie_gen_warning from environment variable 3
gpu_pcie_width_warning from environment variable 2
test_loops from environment variable 1
test_system from environment variable 1
rank 4 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
rank 6 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
rank 7 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
rank 5 on host a8369df39cce : NO GPU AVAILABLE, SETTING GPU_DGEMM_SPLIT=0.0
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)
CUDART: cudaGetDevice(&gpuid) = 100 (no CUDA-capable device is detected) at (…/HPL_pddriver.c:448)

    ******** TESTING SYSTEM PARAMETERS ********
    PARAM   [UNITS]         MIN     MAX     AVG
    -----   -------         ---     ---     ---

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[49809,1],5]
Exit code: 1

[root@epyc hpl]#

What wrong?

When I run HPL

arachko · July 17, 2021, 1:08pm

P.S. i use seccomp from there: Allowing numactl in docker container · GitHub

arachko · July 17, 2021, 1:11pm

I tried run HPL via Pyxis/Enroot, but it not work too:

[root@epyc hpl]# enroot start nvidia+hpc-benchmarks+21.4-hpl
nvidia-container-cli: mount error: file creation failed: /root/.local/share/enroot/nvidia+hpc-benchmarks+21.4-hpl/run/nvidia-persistenced/socket: no such device or address
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

Robert_Crovella · July 17, 2021, 1:49pm

When running HPL, you use 1 MPI rank per GPU. Furthermore, you’ll probably need to change the problem size to match 4 GPUs not 8 (which is what the container is set up for) and you should change the PxQ process grid, probably to 2x2. The warnings about infiniband can be safely ignored for a single machine.

Topic		Replies	Views
Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit Container: HPC cuda	2	1299	February 22, 2024
HPL CUDA Programming and Performance	11	42385	July 18, 2011
Run hpc_benchmark23.10 HPL with v100GPU GPU-Accelerated Libraries hpc , benchmarks , hpc-x	3	1521	January 25, 2024
Run HPL benckmark 23.3 on A800(80GB) GPU-Accelerated Libraries cuda	0	1187	April 20, 2023
Settings for HPL CUDA Programming and Performance	7	4330	February 13, 2012
CUDA accelerated Linpack seemingly not using any GPU CUDA Programming and Performance	18	3660	March 26, 2018
HPL on Kepler GPUs CUDA Programming and Performance	3	5087	March 12, 2018
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30432	December 20, 2009
HPL benchmark on A100(40GB PCIe) GPU-Accelerated Libraries cuda	1	1386	May 8, 2022
HPLinpack for CUDA Any interest? CUDA Programming and Performance	27	11953	May 10, 2012

Run HPL on 4x A100

================================================================================ HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[49809,1],5] Exit code: 1

Related topics

================================================================================
HPL-NVIDIA 1.0.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[49809,1],5]
Exit code: 1