I finally got the GPU version of HPL to run on my system, but the results are not what they should be. I followed the Howto - HPL on NVIDIA GPUs guide, but my results are not comparable. I’m using the following bash script to run HPL between 2 nodes, each having 24 processors and 4 GPUs:
export HPL_DIR=/net/user/erasmussen/hpl-2.0_FERMI_v13
export OMP_NUM_THREADS=6
export MKL_NUM_THREADS=6
export MKL_DYNAMIC=FALSE
export CUDA_DGEMM_SPLIT=0.836
export CUDA_DTRSM_SPLIT=0.806
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:/usr/local/cuda/lib64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:$LD_LIBRARY_PATH
mpirun --machinefile machinefile --x LD_LIBRARY_PATH -np 8 xhpl
My machine file just lists each node I’m trying to run on 4 times each. Is this how I’m supposed to run the program? Or do I need to change something in my runscript / mpirun command? Currently I’m getting about 500 GFLOPs as a total score between the two nodes when each M2070 GPU should be getting more than that alone.