Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit

I run docker with the command as follows
docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/hpc-benchmarks:23.10

At the directory root@931eb37487ed:/workspace# cd /hpl-linux-aarch64-gpu
I run HPL with the command as follows
mpirun -n 2 ./hpl-aarch64-gpu.sh --cpu-affinity 0-39:40-79 --gpu-affinity 0:1 --dat ./sample-dat/HPL-2GPUs.dat

but I get the error as follows

================================================================================
HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 136608
NB : 1024
PMAP : Column-major process mapping
P : 2
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.
  • The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
  • The relative machine precision (eps) is taken to be 1.110223e-16
  • Computational tests pass if scaled residuals are less than 16.0

HPL-NVIDIA ignores the following parameters from input file:
* Broadcast parameters
* Panel factorization parameters
* Look-ahead value
* L1 layout
* U layout
* Equilibration parameter
* Memory alignment parameter

HPL-NVIDIA settings from environment variables:
monitor_gpu from environment variable 0
warmup_end_prog from environment variable 5.0
test_loops from environment variable 1
hpl_cfg_cuda_vmm from environment variable 0

Device info:
Peak clock frequency 1410 MHz
SM 80
Number of SMs 108
Total memory available 39.39 GB
canUseHostPointerForRegisteredMem 1
canMapHostMemory 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 1] Peer GPU 0 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 0] Peer GPU 1 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

[HPL TRACE] cuda_nvshmem_init: max=0.0665 (0) min=0.0648 (1)
[WARNING] Change Input N 136608 to 136192
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 1] Peer GPU 0 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: No such file or directory, exiting… mutex destroy failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/topo/topo.cpp:420: [GPU 0] Peer GPU 1 is not accessible, exiting …
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:843: non-zero status: 3 building transport map failed

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:933: nvshmem initialization failed, exiting

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: No such file or directory, exiting… mutex destroy failed


Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[27093,1],1]
Exit code: 255

HPL benchmark run with one GPU , I get the successful result of HPL benchmark.
root@33b004828267:/workspace/hpl-linux-aarch64-gpu# mpirun -n 1 xhpl ./sample-dat/HPL-1GPU.dat

================================================================================
HPL-NVIDIA 23.10.0 – NVIDIA accelerated HPL benchmark – NVIDIA

HPLinpack 2.1 – High-Performance Linpack benchmark – October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 70000
NB : 200
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.
  • The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
  • The relative machine precision (eps) is taken to be 1.110223e-16
  • Computational tests pass if scaled residuals are less than 16.0

HPL-NVIDIA ignores the following parameters from input file:
* Broadcast parameters
* Panel factorization parameters
* Look-ahead value
* L1 layout
* U layout
* Equilibration parameter
* Memory alignment parameter

HPL-NVIDIA settings from environment variables:

Device info:
Peak clock frequency 1410 MHz
SM 80
Number of SMs 108
Total memory available 39.39 GB
canUseHostPointerForRegisteredMem 1
canMapHostMemory 1
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.2/main_nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:215: init failed for remote transport: ibrc
[HPL TRACE] cuda_nvshmem_init: max=1.7898 (0) min=1.7898 (0)
[WARNING] Change Input NB 200 to 192
[WARNING] Change Input N 70000 to 69888
[HPL TRACE] ncclCommInitRank: max=0.0788 (0) min=0.0788 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.1107 (0) min=0.1107 (0)
Per-Process Host Memory Estimate: 0.00 GB (MAX) 0.00 GB (MIN)
Per-Process Device Memory Estimate: 36.70 GB (MAX) 36.70 GB (MIN)
[HPL TRACE] hpl_cfg_cusolver_mp_tests dev_matgen_t: max=0.5158 (0) min=0.5158 (0)

… Testing HPL components …

**** Factorization, m = 69888, policy = 0 ****
avg time = 3.70 ms, avg = 696.59. min = 696.59 [rank 0, host 33b004828267, gpuID 000C:01:00.0], max = 696.59 GFLOPS

**** Factorization, m = 69888, policy = 1 ****
avg time = 4.34 ms, avg = 594.25. min = 594.25 [rank 0, host 33b004828267, gpuID 000C:01:00.0], max = 594.25 GFLOPS

**** GEMM - cublas ****
avg time = 6.15 ms, avg = 13400.34. min = 13400.34 [rank 0, host 33b004828267, gpuID 000C:01:00.0], max = 13400.34 GFLOPS

… End of Testing HPL components …

[HPL TRACE] dev_matgen_t: max=0.3126 (0) min=0.3126 (0)
[HPL TRACE] dev_vecgen: max=0.0001 (0) min=0.0001 (0)
2024-01-05 08:48:58.383
Prog= 1.64% N_left= 69504 Time= 0.29 Time_left= 17.14 iGF= 13055.90 GF= 13055.90 iGF_per= 13055.90 GF_per= 13055.90
Prog= 3.26% N_left= 69120 Time= 0.57 Time_left= 16.82 iGF= 13123.36 GF= 13089.36 iGF_per= 13123.36 GF_per= 13089.36
Prog= 4.86% N_left= 68736 Time= 0.85 Time_left= 16.56 iGF= 13044.73 GF= 13074.61 iGF_per= 13044.73 GF_per= 13074.61
Prog= 6.45% N_left= 68352 Time= 1.13 Time_left= 16.41 iGF= 12682.69 GF= 12976.03 iGF_per= 12682.69 GF_per= 12976.03

GF_per= 11989.06 GF_per= 12797.34
Prog= 99.89% N_left= 7296 Time= 17.82 Time_left= 0.02 iGF= 8680.18 GF= 12753.35 iGF_per= 8680.18 GF_per= 12753.35
2024-01-05 08:49:16.335

T/V N NB P Q Time Gflops ( per GPU)

WC0 69888 192 1 1 17.95 1.268e+04 ( 1.268e+04)

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0003266 … PASSED
||Ax-b||_oo . . . . . . . . . . . . . . . . . = 0.0000000007414634
||A||_oo . . . . . . . . . . . . . . . . . . . = 17639.5098762894886022
||x||_oo . . . . . . . . . . . . . . . . . . . = 16.5856498191736499
||b||_oo . . . . . . . . . . . . . . . . . . . = 0.4999926335987157

Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.

End of Tests.

I have also got such a problem. Have you resolved it?