Run HPL benckmark 23.3 on A800(80GB)

Hi, I want to run HPL on A800. I extract the xhpl file from the HPC-Benchmark 23.3 image container because I don’t want to run it in docker. I run it with following script

#!/bin/bash

export OMP_NUM_THREADS=12
export MKL_NUM_THREADS=12
export MKL_DYNAMIC=FALSE
export TRSM_CUTOFF=600000 

export GPU_DGEMM_SPLIT=1 #gpu and cpu ratio
lrank=$OMPI_COMM_WORLD_LOCAL_RANK

export CUDA_VISIBLE_DEVICES=$lrank 


echo $lrank

case ${lrank%2} in

       0)
        numactl --cpunodebind=0 --membind=0 \
          ./xhpl
         ;;
       1)
	      numactl --cpunodebind=1 --membind=1 \
          ./xhpl
         ;;
esac

And I run above script with command mpirun -np 1 ./one_gpu.sh. But I get following error

--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: gpu1
--------------------------------------------------------------------------

================================================================================
HPL-NVIDIA 23.3.0  -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   92800 
NB     :    1024 
PMAP   : Column-major process mapping
P      :       1 
Q      :       1 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :  2ringM 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : transposed form
EQUIL  : no
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

gpu_dgemm_split from environment variable 1.000 
using in-house GEMM implementation
hpl_cfg_cusolver_mp_tests = 1 
hpl_cfg_chunk_size_nbs = 16
hpl_cfg_p2p_policy = 1 (0 -> ncclBcast, 1 -> ncclSend/Recv, 2 -> CUDA-aware MPI, 3 -> host MPI, 4 -> NVSHMEM)
hpl_cfg_fct_comm_policy = 0 (0 -> nvshmem (default), 1 -> host MPI)
hpl_cfg_fct_min_gpux = 8
hpl_cfg_fct_switch_n = 32768
hpl_cfg_cta_per_fct = 16
hpl_cfg_dist_trsm_flag = 1
hpl_cfg_use_nvshmem = 1
hpl_cfg_debug = 0
Device info:
        Peak clock frequency 1410 MHz
        SM 80
        Number of SMs 108
        Total memory available 79.18 GB
        canUseHostPointerForRegisteredMem 1
        canMapHostMemory 1
[HPL TRACE] cuda_nvshmem_init: max=2.6062 (0) min=2.6062 (0)
[WARNING] Change Input N  = 92800 to 92160
[HPL TRACE] ncclCommInitRank: max=0.0563 (0) min=0.0563 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.0722 (0) min=0.0722 (0)

Per-Process   Host Memory Estimate: 0.00 GB (MAX) 0.00 GB (MIN)

Per-Process Device Memory Estimate: 66.13 GB (MAX) 66.13 GB (MIN)
[HPL TRACE] hpl_cfg_cusolver_mp_tests dev_matgen_t: max=0.4786 (0) min=0.4786 (0)

 ... Testing HPL components ... 

 **** Factorization, m = 92160, policy = 0 **** 
avg time =    38.19 ms, avg =  2530.43 (min  2530.43, max  2530.43) GFLOPS

 **** Factorization, m = 92160, policy = 1 **** 
avg time =    42.24 ms, avg =  2287.63 (min  2287.63, max  2287.63) GFLOPS

 **** Factorization, m = 46080, policy = 0 **** 
avg time =    24.98 ms, avg =  1934.10 (min  1934.10, max  1934.10) GFLOPS

 **** Factorization, m = 46080, policy = 1 **** 
avg time =    31.15 ms, avg =  1551.13 (min  1551.13, max  1551.13) GFLOPS

 **** Factorization, m = 22528, policy = 0 **** 
avg time =    19.56 ms, avg =  1207.93 (min  1207.93, max  1207.93) GFLOPS

 **** Factorization, m = 22528, policy = 1 **** 
avg time =    25.97 ms, avg =   909.64 (min   909.64, max   909.64) GFLOPS

 **** Factorization, m = 1024, policy = 0 **** 
avg time =    14.44 ms, avg =    74.34 (min    74.34, max    74.34) GFLOPS

 **** Factorization, m = 1024, policy = 1 **** 
avg time =    20.77 ms, avg =    51.69 (min    51.69, max    51.69) GFLOPS

 **** ncclBcast( Row ) **** 
avg time =     0.00 ms, avg = 1011081.72 (min 1011081.72, max 1011081.72) GBS

 **** ncclAllGather( Col ) **** 
avg time =     0.00 ms, avg = 390167.81 (min 390167.81, max 390167.81) GBS

 **** Latency ncclAllGather, m = 1 **** 
avg time =     0.12 ms, avg =     0.07 (min     0.07, max     0.07) GBS

 **** Latency ncclAllGather, m = 2 **** 
avg time =     0.12 ms, avg =     0.14 (min     0.14, max     0.14) GBS

 **** Latency ncclAllGather, m = 32 **** 
avg time =     0.12 ms, avg =     2.20 (min     2.20, max     2.20) GBS

 **** Latency ncclAllGather, m = 1024 **** 
avg time =     0.12 ms, avg =    71.13 (min    71.13, max    71.13) GBS

 **** Latency ncclAllGather, m = 2048 **** 
avg time =     0.12 ms, avg =   136.47 (min   136.47, max   136.47) GBS

 **** Latency Host MPI_Allgather, m = 1 **** 
avg time =     0.02 ms, avg =     0.42 (min     0.42, max     0.42) GBS

 **** Latency Host MPI_Allgather, m = 2 **** 
avg time =     0.02 ms, avg =     0.81 (min     0.81, max     0.81) GBS

 **** Latency Host MPI_Allgather, m = 32 **** 
avg time =     0.02 ms, avg =    13.00 (min    13.00, max    13.00) GBS

 **** Latency Host MPI_Allgather, m = 1024 **** 
avg time =     0.02 ms, avg =   431.63 (min   431.63, max   431.63) GBS

 **** Latency Host MPI_Allgather, m = 2048 **** 
avg time =     0.02 ms, avg =   848.44 (min   848.44, max   848.44) GBS

 **** Latency Dev MPI_Allgather, m = 1 **** 
[gpu1:200795:0:200795] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x100625b2800)
==== backtrace (tid: 200795) ====
 0  /mnt/home/benchmark/local_apps/ucx-1.13.1/lib/libucs.so.0(ucs_handle_error+0x294) [0x150885c84434]
 1  /mnt/home/benchmark/local_apps/ucx-1.13.1/lib/libucs.so.0(+0x2f5ec) [0x150885c845ec]
 2  /mnt/home/benchmark/local_apps/ucx-1.13.1/lib/libucs.so.0(+0x2f898) [0x150885c84898]
 3  /lib64/libc.so.6(+0xceeb7) [0x15089c177eb7]
 4  /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libopen-pal.so.40(+0x4b230) [0x15089bb35230]
 5  /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(ompi_datatype_sndrcv+0x2c8) [0x15089d22d188]
 6  /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(MPI_Allgather+0x143) [0x15089d22df33]
 7  ./xhpl(+0x30cee) [0x562b5d3e1cee]
 8  ./xhpl(+0xe0dd5) [0x562b5d491dd5]
 9  ./xhpl(+0x1e71b) [0x562b5d3cf71b]
10  /lib64/libc.so.6(__libc_start_main+0xe5) [0x15089c0e3d85]
11  ./xhpl(+0x1f70e) [0x562b5d3d070e]
=================================
[gpu1:200795] *** Process received signal ***
[gpu1:200795] Signal: Segmentation fault (11)
[gpu1:200795] Signal code:  (-6)
[gpu1:200795] Failing at address: 0x3eb0003105b
[gpu1:200795] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x15089ca2dcf0]
[gpu1:200795] [ 1] /lib64/libc.so.6(+0xceeb7)[0x15089c177eb7]
[gpu1:200795] [ 2] /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libopen-pal.so.40(+0x4b230)[0x15089bb35230]
[gpu1:200795] [ 3] /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(ompi_datatype_sndrcv+0x2c8)[0x15089d22d188]
[gpu1:200795] [ 4] /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(MPI_Allgather+0x143)[0x15089d22df33]
[gpu1:200795] [ 5] ./xhpl(+0x30cee)[0x562b5d3e1cee]
[gpu1:200795] [ 6] ./xhpl(+0xe0dd5)[0x562b5d491dd5]
[gpu1:200795] [ 7] ./xhpl(+0x1e71b)[0x562b5d3cf71b]
[gpu1:200795] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x15089c0e3d85]
[gpu1:200795] [ 9] ./xhpl(+0x1f70e)[0x562b5d3d070e]
[gpu1:200795] *** End of error message ***
./one_gpu.sh: line 16: 200795 Segmentation fault      (core dumped) numactl --cpunodebind=0 --membind=0 ./xhpl
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23975,1],0]
  Exit code:    139

What’s more, I can do this successfully when using the benckmark 21.4. And the packages I use includes

nccl 
nvshmem-2.9.0
openmpi-4.1.4 with ucx building
ucx-1.13.1
cuda-12.0

Thanks!