Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit

user102047 · December 21, 2023, 1:56am

I run docker with the command as follows
docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/hpc-benchmarks:23.10

At the directory root@931eb37487ed:/workspace# cd /hpl-linux-aarch64-gpu
I run HPL with the command as follows
mpirun -n 2 ./hpl-aarch64-gpu.sh --cpu-affinity 0-39:40-79 --gpu-affinity 0:1 --dat ./sample-dat/HPL-2GPUs.dat

but I get the error as follows

================================================================================
HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 136608
NB : 1024
PMAP : Column-major process mapping
P : 2
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words

The matrix A is randomly generated for each test.
The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
The relative machine precision (eps) is taken to be 1.110223e-16
Computational tests pass if scaled residuals are less than 16.0

HPL-NVIDIA ignores the following parameters from input file:
* Broadcast parameters
* Panel factorization parameters
* Look-ahead value
* L1 layout
* U layout
* Equilibration parameter
* Memory alignment parameter

HPL-NVIDIA settings from environment variables:
monitor_gpu from environment variable 0
warmup_end_prog from environment variable 5.0
test_loops from environment variable 1
hpl_cfg_cuda_vmm from environment variable 0

Device info:
Peak clock frequency 1410 MHz
SM 80
Number of SMs 108
Total memory available 39.39 GB
canUseHostPointerForRegisteredMem 1
canMapHostMemory 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 1] Peer GPU 0 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 0] Peer GPU 1 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

[HPL TRACE] cuda_nvshmem_init: max=0.0665 (0) min=0.0648 (1)
[WARNING] Change Input N 136608 to 136192
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 1] Peer GPU 0 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: No such file or directory, exiting… mutex destroy failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 0] Peer GPU 1 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: No such file or directory, exiting… mutex destroy failed

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[27093,1],1]
Exit code: 255

user102047 · January 5, 2024, 8:52am

HPL benchmark run with one GPU , I get the successful result of HPL benchmark.
root@33b004828267:/workspace/hpl-linux-aarch64-gpu# mpirun -n 1 xhpl ./sample-dat/HPL-1GPU.dat

================================================================================
HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 70000
NB : 200
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words

The matrix A is randomly generated for each test.
The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
The relative machine precision (eps) is taken to be 1.110223e-16
Computational tests pass if scaled residuals are less than 16.0

HPL-NVIDIA ignores the following parameters from input file:
* Broadcast parameters
* Panel factorization parameters
* Look-ahead value
* L1 layout
* U layout
* Equilibration parameter
* Memory alignment parameter

HPL-NVIDIA settings from environment variables:

Device info:
Peak clock frequency 1410 MHz
SM 80
Number of SMs 108
Total memory available 39.39 GB
canUseHostPointerForRegisteredMem 1
canMapHostMemory 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
[HPL TRACE] cuda_nvshmem_init: max=1.7898 (0) min=1.7898 (0)
[WARNING] Change Input NB 200 to 192
[WARNING] Change Input N 70000 to 69888
[HPL TRACE] ncclCommInitRank: max=0.0788 (0) min=0.0788 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.1107 (0) min=0.1107 (0)
Per-Process Host Memory Estimate: 0.00 GB (MAX) 0.00 GB (MIN)
Per-Process Device Memory Estimate: 36.70 GB (MAX) 36.70 GB (MIN)
[HPL TRACE] hpl_cfg_cusolver_mp_tests dev_matgen_t: max=0.5158 (0) min=0.5158 (0)

… Testing HPL components …

**** Factorization, m = 69888, policy = 0 ****
avg time = 3.70 ms, avg = 696.59. min = 696.59 [rank 0, host 33b004828267, gpuID 000C:01:00.0], max = 696.59 GFLOPS

**** Factorization, m = 69888, policy = 1 ****
avg time = 4.34 ms, avg = 594.25. min = 594.25 [rank 0, host 33b004828267, gpuID 000C:01:00.0], max = 594.25 GFLOPS
…

**** GEMM - cublas ****
avg time = 6.15 ms, avg = 13400.34. min = 13400.34 [rank 0, host 33b004828267, gpuID 000C:01:00.0], max = 13400.34 GFLOPS

… End of Testing HPL components …

[HPL TRACE] dev_matgen_t: max=0.3126 (0) min=0.3126 (0)
[HPL TRACE] dev_vecgen: max=0.0001 (0) min=0.0001 (0)
2024-01-05 08:48:58.383
Prog= 1.64% N_left= 69504 Time= 0.29 Time_left= 17.14 iGF= 13055.90 GF= 13055.90 iGF_per= 13055.90 GF_per= 13055.90
Prog= 3.26% N_left= 69120 Time= 0.57 Time_left= 16.82 iGF= 13123.36 GF= 13089.36 iGF_per= 13123.36 GF_per= 13089.36
Prog= 4.86% N_left= 68736 Time= 0.85 Time_left= 16.56 iGF= 13044.73 GF= 13074.61 iGF_per= 13044.73 GF_per= 13074.61
Prog= 6.45% N_left= 68352 Time= 1.13 Time_left= 16.41 iGF= 12682.69 GF= 12976.03 iGF_per= 12682.69 GF_per= 12976.03
…

GF_per= 11989.06 GF_per= 12797.34
Prog= 99.89% N_left= 7296 Time= 17.82 Time_left= 0.02 iGF= 8680.18 GF= 12753.35 iGF_per= 8680.18 GF_per= 12753.35
2024-01-05 08:49:16.335

T/V N NB P Q Time Gflops ( per GPU)

WC0 69888 192 1 1 17.95 1.268e+04 ( 1.268e+04)

||Ax-b||_oo/(eps(||A||_oo||x||_oo+||b||_oo)*N)= 0.0003266 … PASSED
||Ax-b||_oo . . . . . . . . . . . . . . . . . = 0.0000000007414634
||A||_oo . . . . . . . . . . . . . . . . . . . = 17639.5098762894886022
||x||_oo . . . . . . . . . . . . . . . . . . . = 16.5856498191736499
||b||_oo . . . . . . . . . . . . . . . . . . . = 0.4999926335987157

Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.

End of Tests.

Sala · February 22, 2024, 1:58pm

I have also got such a problem. Have you resolved it?

Topic		Replies	Views
HPL CUDA Programming and Performance	11	42456	July 18, 2011
Run HPL on 4x A100 CUDA Programming and Performance	3	3131	July 17, 2021
Run hpc_benchmark23.10 HPL with v100GPU GPU-Accelerated Libraries hpc , benchmarks , hpc-x	3	1640	January 25, 2024
Error while running NVIDIA HPL benchmark for H100 GPU-Accelerated Libraries	1	1352	April 2, 2024
Run HPL benckmark 23.3 on A800(80GB) GPU-Accelerated Libraries cuda	0	1224	April 20, 2023
CUDA Cluster - HPL Help CUDA Programming and Performance	1	1500	October 3, 2013
Running Fermi-HPL (not using GPUs) Fermi-HPL benchmark not using Gpus CUDA Programming and Performance	5	2762	April 23, 2012
HPC Container HPL-21.4 MPI_Recv error Container: HPC	5	2403	March 24, 2022
HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin CUDA Programming and Performance	11	30460	December 20, 2009
LinPack HPL to benchmark NVIDIA GPUs CUDA Programming and Performance	18	16500	March 8, 2018

Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit

================================================================================ HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[27093,1],1] Exit code: 255

================================================================================ HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver

GF_per= 11989.06 GF_per= 12797.34 Prog= 99.89% N_left= 7296 Time= 17.82 Time_left= 0.02 iGF= 8680.18 GF= 12753.35 iGF_per= 8680.18 GF_per= 12753.35 2024-01-05 08:49:16.335

T/V N NB P Q Time Gflops ( per GPU)

WC0 69888 192 1 1 17.95 1.268e+04 ( 1.268e+04)

Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values.

End of Tests.

Related topics

================================================================================
HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[27093,1],1]
Exit code: 255

================================================================================
HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

GF_per= 11989.06 GF_per= 12797.34
Prog= 99.89% N_left= 7296 Time= 17.82 Time_left= 0.02 iGF= 8680.18 GF= 12753.35 iGF_per= 8680.18 GF_per= 12753.35
2024-01-05 08:49:16.335

Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.