I run a 23.10HPL_BENCHMARK with v100。
this is nvidia-smi in docker.
I run it with below code
mpirun -np 1 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_0:1 ./hpl.sh --dat HPL-1GPU.dat --no-multinode --cuda-compat
But I encounter errors like this
================================================================================
HPL-NVIDIA 23.10.0 -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 92800
NB : 1024
PMAP : Column-major process mapping
P : 1
Q : 1
PFACT : Left
NBMIN : 2
NDIV : 2
RFACT : Left
BCAST : 2ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
HPL-NVIDIA ignores the following parameters from input file:
* Broadcast parameters
* Panel factorization parameters
* Look-ahead value
* L1 layout
* U layout
* Equilibration parameter
* Memory alignment parameter
HPL-NVIDIA settings from environment variables:
monitor_gpu from environment variable 0
warmup_end_prog from environment variable 5.0
test_loops from environment variable 1
hpl_cfg_cuda_vmm from environment variable 0
Device info:
Peak clock frequency 1380 MHz
SM 70
Number of SMs 80
Total memory available 31.74 GB
canUseHostPointerForRegisteredMem 1
canMapHostMemory 1
[HPL TRACE] cuda_nvshmem_init: max=0.4351 (0) min=0.4351 (0)
[WARNING] Change Input N 92800 to 92160
[HPL TRACE] ncclCommInitRank: max=0.1208 (0) min=0.1208 (0)
[cfe5c217c9f7:133 :0:133] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 133) ====
0 0x0000000000042520 __sigaction() ???:0
=================================
[cfe5c217c9f7:00133] *** Process received signal ***
[cfe5c217c9f7:00133] Signal: Segmentation fault (11)
[cfe5c217c9f7:00133] Signal code: (-6)
[cfe5c217c9f7:00133] Failing at address: 0x85
[cfe5c217c9f7:00133] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4da9956520]
[cfe5c217c9f7:00133] *** End of error message ***
./hpl.sh: line 254: 133 Segmentation fault (core dumped) ${NUMCMD} ${CPUBIND} ${MEMBIND} ${XHPL} ${DAT}
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[44010,1],0]
Exit code: 139
--------------------------------------------------------------------------
It seem error happen in
cugetrfs_mp_init
Can someone help us?
