Settings for HPL

I finally got the GPU version of HPL to run on my system, but the results are not what they should be. I followed the Howto - HPL on NVIDIA GPUs guide, but my results are not comparable. I’m using the following bash script to run HPL between 2 nodes, each having 24 processors and 4 GPUs:
export HPL_DIR=/net/user/erasmussen/hpl-2.0_FERMI_v13
export OMP_NUM_THREADS=6
export MKL_NUM_THREADS=6
export MKL_DYNAMIC=FALSE
export CUDA_DGEMM_SPLIT=0.836
export CUDA_DTRSM_SPLIT=0.806
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:/usr/local/cuda/lib64:/opt/intel/composer_xe_2011_sp1.7.256/mkl/lib/intel64:/opt/intel/composer_xe_2011_sp1.7.256/compiler/lib/intel64:$LD_LIBRARY_PATH
mpirun --machinefile machinefile --x LD_LIBRARY_PATH -np 8 xhpl

My machine file just lists each node I’m trying to run on 4 times each. Is this how I’m supposed to run the program? Or do I need to change something in my runscript / mpirun command? Currently I’m getting about 500 GFLOPs as a total score between the two nodes when each M2070 GPU should be getting more than that alone.

What N and NB did you use?
How much memory is on the system and which interconnect?

There are 96 GB of memory on each node. I used this calculator to get an NB of 143360, N of 1, P of 2, and Q of 4. The system is using InfiniBand.

I assume you wanted to write N=143360 and NB=1.
In any case, your choice of NB is incorrect.
Please read the CUDA_LINPACK_README.txt file that explains how to choose the parameters:

NB should be a multiple of 128 (for best performance). It will also work with
NB being a multiple of 64 but with lower performance. 768 typically gives best results
larger values may give better results (1024) if several GPUs share the same PCIe connection

Try to replicate some of the output reported in the file.

I misread the NBs as number of NBs and not actual NBs. I had it set to 512, Here is the HPL.dat file that I used.

HPLinpack benchmark input file

Innovative Computing Laboratory, University of Tennessee

HPL2.out output file name (if any)

8 device out (6=stdout,7=stderr,file)

1 # of problems sizes (N)

143360 Ns

1 # of NBs

512 NBs

0 PMAP process mapping (0=Row-,1=Column-major)

1 # of process grids (P x Q)

2 Ps

4 Qs

16.0 threshold

1 # of panel fact

2 PFACTs (0=left, 1=Crout, 2=Right)

1 # of recursive stopping criterium

4 NBMINs (>= 1)

1 # of panels in recursion

2 NDIVs

1 # of recursive panel fact.

1 RFACTs (0=left, 1=Crout, 2=Right)

1 # of broadcast

1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)

1 # of lookahead depth

1 DEPTHs (>=0)

2 SWAP (0=bin-exch,1=long,2=mix)

64 swapping threshold

1 L1 in (0=transposed,1=no-transposed) form

1 U in (0=transposed,1=no-transposed) form

1 Equilibration (0=no,1=yes)

8 memory alignment in double (> 0)

This line (no. 32) is ignored (it serves as a separator).

0 Number of additional problem sizes for PTRANS

1200 10000 30000 values of N

0 number of additional blocking sizes for PTRANS

40 9 8 13 13 20 16 32 64 values of NB

Something that is also weird is when I use nvidia-smi -q | grep Gpu, the GPU usage shows up as 99% for all 4 on one node, but 0% on the other. Using some Dell management software, I can see that while running HPL, only one GPU shows up as running at 140 watts (kilowatts?) and the others are below 90. In addition, when I run the N-body benchmark, all 4 GPUs show up at 200 watts(kw?) and on both nodes show up at 99% usage. Could my machine file have anything to do with it? This is what I use:

cuda012
cuda012
cuda012
cuda012
cuda013
cuda013
cuda013
cuda013

Machinefile seems ok.
You could enable verbose print in the Makefile in src/cuda and rebuild the libdgemm library.
It will print out the GPU assignments and all the accelerated DGEMM/DTRSM calls.
Which CUDA version are you running?

I am running CUDA 4.0