4X Tesla K20m problem with linpack

~/ Hello!
I have 4 Tesla K20m on a server

hpl-2.0_FERMI_v15 results

WR10L2L2      100000  1152     1     4             240.26              2.775e+03

I think that the performance should be higher

Please help me set the optimal settings in hpl.dat and run_linpack

my HPL.dat

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
1            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
100000         Ns
1             # of NBs
1152         NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1          Ps
4          Qs             #  (2 2 2 4        Qs. for the dual GPU run)
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1 0          DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
1            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

run_linpack

#!/bin/bash

#location of HPL
export HPL_DIR=/home/teslauser/hpl

CPU_CORES_PER_GPU=12

# FOR MKL
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR GOTO
export GOTO_NUM_THREADS=$CPU_CORES_PER_GPU
# FOR OMP
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU

# hint: for 2050 or 2070 card
#       try 350/(350 + MKL_NUM_THREADS*4*cpu frequency in GHz)
export CUDA_DGEMM_SPLIT=0.85

# hint: try CUDA_DGEMM_SPLIT - 0.10
export CUDA_DTRSM_SPLIT=0.75

export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH:/opt/intel/composer_xe_2015.3.187/mkl/lib/intel64/:/usr/local/cuda-7.5/lib64:/opt/intel/composer_xe_2015.3.187/compiler/lib/intel64/:/home/teslauser/.openmpi-1_10_2/lib

~/.openmpi-1_10_2/bin/mpirun -n 4 ./xhpl

I will be very grateful for the help!

This is outside my area of expertise, but best I know the HPC Linpack version provided by NVIDIA splits the work across CPU and GPU. For the most effective suggestions regarding configuration, you might want to give more details of your host system (e.g. vendor, #sockets, type of CPU, amount of system memory)

Here is configuration
-4 Tesla K20m
-memory 192 Gb
-nproc 48
-cpu GenuineIntel Intel® Xeon® CPU E5-2695 v2 @ 2.40GHz

Can anyone help?

What is the setting of your CUDA_DGEMM_SPLIT and CUDA_DTRSM_SPLIT?
How many cpu hw thread you are using for a GPU?
What is the snoop mode you are using on your system?