~/ Hello!
I have 4 Tesla K20m on a server
hpl-2.0_FERMI_v15 results
WR10L2L2 100000 1152 1 4 240.26 2.775e+03
I think that the performance should be higher
Please help me set the optimal settings in hpl.dat and run_linpack
my HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
1 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
100000 Ns
1 # of NBs
1152 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
4 Qs # (2 2 2 4 Qs. for the dual GPU run)
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
1 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
run_linpack
#!/bin/bash
#location of HPL
export HPL_DIR=/home/teslauser/hpl
CPU_CORES_PER_GPU=12
# FOR MKL
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR GOTO
export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR OMP
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU
# hint: for 2050 or 2070 card
# try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz)
export CUDA_DGEMM_SPLIT=0.85
# hint: try CUDA_DGEMM_SPLIT - 0.10
export CUDA_DTRSM_SPLIT=0.75
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH:/opt/intel/composer_xe_2015.3.187/mkl/lib/intel64/:/usr/local/cuda-7.5/lib64:/opt/intel/composer_xe_2015.3.187/compiler/lib/intel64/:/home/teslauser/.openmpi-1_10_2/lib
~/.openmpi-1_10_2/bin/mpirun -n 4 ./xhpl
I will be very grateful for the help!