Hi, I want to run HPL on A800. I extract the xhpl
file from the HPC-Benchmark 23.3 image container because I don’t want to run it in docker. I run it with following script
#!/bin/bash
export OMP_NUM_THREADS=12
export MKL_NUM_THREADS=12
export MKL_DYNAMIC=FALSE
export TRSM_CUTOFF=600000
export GPU_DGEMM_SPLIT=1 #gpu and cpu ratio
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
export CUDA_VISIBLE_DEVICES=$lrank
echo $lrank
case ${lrank%2} in
0)
numactl --cpunodebind=0 --membind=0 \
./xhpl
;;
1)
numactl --cpunodebind=1 --membind=1 \
./xhpl
;;
esac
And I run above script with command mpirun -np 1 ./one_gpu.sh
. But I get following error
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu1
--------------------------------------------------------------------------
================================================================================
HPL-NVIDIA 23.3.0 -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 92800
NB : 1024
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
gpu_dgemm_split from environment variable 1.000
using in-house GEMM implementation
hpl_cfg_cusolver_mp_tests = 1
hpl_cfg_chunk_size_nbs = 16
hpl_cfg_p2p_policy = 1 (0 -> ncclBcast, 1 -> ncclSend/Recv, 2 -> CUDA-aware MPI, 3 -> host MPI, 4 -> NVSHMEM)
hpl_cfg_fct_comm_policy = 0 (0 -> nvshmem (default), 1 -> host MPI)
hpl_cfg_fct_min_gpux = 8
hpl_cfg_fct_switch_n = 32768
hpl_cfg_cta_per_fct = 16
hpl_cfg_dist_trsm_flag = 1
hpl_cfg_use_nvshmem = 1
hpl_cfg_debug = 0
Device info:
Peak clock frequency 1410 MHz
SM 80
Number of SMs 108
Total memory available 79.18 GB
canUseHostPointerForRegisteredMem 1
canMapHostMemory 1
[HPL TRACE] cuda_nvshmem_init: max=2.6062 (0) min=2.6062 (0)
[WARNING] Change Input N = 92800 to 92160
[HPL TRACE] ncclCommInitRank: max=0.0563 (0) min=0.0563 (0)
[HPL TRACE] cugetrfs_mp_init: max=0.0722 (0) min=0.0722 (0)
Per-Process Host Memory Estimate: 0.00 GB (MAX) 0.00 GB (MIN)
Per-Process Device Memory Estimate: 66.13 GB (MAX) 66.13 GB (MIN)
[HPL TRACE] hpl_cfg_cusolver_mp_tests dev_matgen_t: max=0.4786 (0) min=0.4786 (0)
... Testing HPL components ...
**** Factorization, m = 92160, policy = 0 ****
avg time = 38.19 ms, avg = 2530.43 (min 2530.43, max 2530.43) GFLOPS
**** Factorization, m = 92160, policy = 1 ****
avg time = 42.24 ms, avg = 2287.63 (min 2287.63, max 2287.63) GFLOPS
**** Factorization, m = 46080, policy = 0 ****
avg time = 24.98 ms, avg = 1934.10 (min 1934.10, max 1934.10) GFLOPS
**** Factorization, m = 46080, policy = 1 ****
avg time = 31.15 ms, avg = 1551.13 (min 1551.13, max 1551.13) GFLOPS
**** Factorization, m = 22528, policy = 0 ****
avg time = 19.56 ms, avg = 1207.93 (min 1207.93, max 1207.93) GFLOPS
**** Factorization, m = 22528, policy = 1 ****
avg time = 25.97 ms, avg = 909.64 (min 909.64, max 909.64) GFLOPS
**** Factorization, m = 1024, policy = 0 ****
avg time = 14.44 ms, avg = 74.34 (min 74.34, max 74.34) GFLOPS
**** Factorization, m = 1024, policy = 1 ****
avg time = 20.77 ms, avg = 51.69 (min 51.69, max 51.69) GFLOPS
**** ncclBcast( Row ) ****
avg time = 0.00 ms, avg = 1011081.72 (min 1011081.72, max 1011081.72) GBS
**** ncclAllGather( Col ) ****
avg time = 0.00 ms, avg = 390167.81 (min 390167.81, max 390167.81) GBS
**** Latency ncclAllGather, m = 1 ****
avg time = 0.12 ms, avg = 0.07 (min 0.07, max 0.07) GBS
**** Latency ncclAllGather, m = 2 ****
avg time = 0.12 ms, avg = 0.14 (min 0.14, max 0.14) GBS
**** Latency ncclAllGather, m = 32 ****
avg time = 0.12 ms, avg = 2.20 (min 2.20, max 2.20) GBS
**** Latency ncclAllGather, m = 1024 ****
avg time = 0.12 ms, avg = 71.13 (min 71.13, max 71.13) GBS
**** Latency ncclAllGather, m = 2048 ****
avg time = 0.12 ms, avg = 136.47 (min 136.47, max 136.47) GBS
**** Latency Host MPI_Allgather, m = 1 ****
avg time = 0.02 ms, avg = 0.42 (min 0.42, max 0.42) GBS
**** Latency Host MPI_Allgather, m = 2 ****
avg time = 0.02 ms, avg = 0.81 (min 0.81, max 0.81) GBS
**** Latency Host MPI_Allgather, m = 32 ****
avg time = 0.02 ms, avg = 13.00 (min 13.00, max 13.00) GBS
**** Latency Host MPI_Allgather, m = 1024 ****
avg time = 0.02 ms, avg = 431.63 (min 431.63, max 431.63) GBS
**** Latency Host MPI_Allgather, m = 2048 ****
avg time = 0.02 ms, avg = 848.44 (min 848.44, max 848.44) GBS
**** Latency Dev MPI_Allgather, m = 1 ****
[gpu1:200795:0:200795] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x100625b2800)
==== backtrace (tid: 200795) ====
0 /mnt/home/benchmark/local_apps/ucx-1.13.1/lib/libucs.so.0(ucs_handle_error+0x294) [0x150885c84434]
1 /mnt/home/benchmark/local_apps/ucx-1.13.1/lib/libucs.so.0(+0x2f5ec) [0x150885c845ec]
2 /mnt/home/benchmark/local_apps/ucx-1.13.1/lib/libucs.so.0(+0x2f898) [0x150885c84898]
3 /lib64/libc.so.6(+0xceeb7) [0x15089c177eb7]
4 /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libopen-pal.so.40(+0x4b230) [0x15089bb35230]
5 /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(ompi_datatype_sndrcv+0x2c8) [0x15089d22d188]
6 /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(MPI_Allgather+0x143) [0x15089d22df33]
7 ./xhpl(+0x30cee) [0x562b5d3e1cee]
8 ./xhpl(+0xe0dd5) [0x562b5d491dd5]
9 ./xhpl(+0x1e71b) [0x562b5d3cf71b]
10 /lib64/libc.so.6(__libc_start_main+0xe5) [0x15089c0e3d85]
11 ./xhpl(+0x1f70e) [0x562b5d3d070e]
=================================
[gpu1:200795] *** Process received signal ***
[gpu1:200795] Signal: Segmentation fault (11)
[gpu1:200795] Signal code: (-6)
[gpu1:200795] Failing at address: 0x3eb0003105b
[gpu1:200795] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x15089ca2dcf0]
[gpu1:200795] [ 1] /lib64/libc.so.6(+0xceeb7)[0x15089c177eb7]
[gpu1:200795] [ 2] /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libopen-pal.so.40(+0x4b230)[0x15089bb35230]
[gpu1:200795] [ 3] /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(ompi_datatype_sndrcv+0x2c8)[0x15089d22d188]
[gpu1:200795] [ 4] /mnt/home/benchmark/local_apps/openmpi-4.1.4-ucx/lib/libmpi.so.40(MPI_Allgather+0x143)[0x15089d22df33]
[gpu1:200795] [ 5] ./xhpl(+0x30cee)[0x562b5d3e1cee]
[gpu1:200795] [ 6] ./xhpl(+0xe0dd5)[0x562b5d491dd5]
[gpu1:200795] [ 7] ./xhpl(+0x1e71b)[0x562b5d3cf71b]
[gpu1:200795] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x15089c0e3d85]
[gpu1:200795] [ 9] ./xhpl(+0x1f70e)[0x562b5d3d070e]
[gpu1:200795] *** End of error message ***
./one_gpu.sh: line 16: 200795 Segmentation fault (core dumped) numactl --cpunodebind=0 --membind=0 ./xhpl
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[23975,1],0]
Exit code: 139
What’s more, I can do this successfully when using the benckmark 21.4. And the packages I use includes
nccl
nvshmem-2.9.0
openmpi-4.1.4 with ucx building
ucx-1.13.1
cuda-12.0
Thanks!