Poor results from CUDA Linpack on K80

I am running NVIDIA’s CUDA Linpack (hpl-2.0_FERMI_v15) on various size cloud VMs containing Tesla K80s. I can never get above 50% efficiency, however (1.455 TFlops / 2.91 TFlops). I have tried tuning, but have not had much luck. Has anyone had any luck running CUDA HPL on a K80? I am using Intel MKL, OpenMPI, and CUDA V8.0.44.

I have read K80 benchmarking articles by Dell which provide their configuration and results, and they report over 80% efficiency. They do however use a different version of NVIDIA Linpack, version 2.1 but I have not found that anywhere.
http://www.principledtechnologies.com/Dell/PowerEdge_C4130_NVIDIA_Tesla_K80_GPU_0315.pdf and http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/12/12/dell-poweredge-c4130-performance-with-k80-gpus-hpl

Any advice is welcome.

Results running on a single K80:

T/V                N    NB     P     Q               Time                 Gflops
WR03R2L2      114688  1024     1     2             691.11              1.455e+03
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0042779 ...... PASSED

Relevant configuration:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
114688       Ns
1            # of NBs
1024         NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
2            Qs
16.0         threshold
1            # of panel fact
0            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
3            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
32           memory alignment in double (> 0)

run_linpack relevant config:


export CUDA_DGEMM_SPLIT=.999
export CUDA_DTRSM_SPLIT=.999

To run:

mpirun -np 2 ./run_linpack

Knowing nothing about the linpack code or calling harness, I still did notice that you’re on a K80, which is not a single GPU, but two GPUs on one board. And your efficiency is under 50%. Is it possible that the code is running on only one GPU, not both? That’d be especially likely if you’re reaching around 40%, half of the 80% Dell reports.

Yes the K80 is two GPUs, and I did in fact run it on both, hence the value of 2 for Q in the HPL config. I confirmed this as well using nvidia-smi.

Yes, the best HPL performance will come from HPL code specifically provided on a case-by-case basis by NVIDIA. It is not publicly available, and the hpl-2.0_FERMI_v15 will not achieve highest performance on GPUs newer than FERMI.