HPL on Kepler GPUs

I’m trying to do benchmarks on a couple machines which have GTX690 cards installed. When I run a benchmark it does not appear to touch the GPUs and the results I receive are identical to machines with no GPUs.

I have suspicions that this may be because the GTX690 cards are Kepler versus Fermi but wonder if that is the case if there is any way to run HPL using the GPUs I have?

Thank you!

This is the hardware I’m running on:
Intel i7 3930K CPU
32GB
4 x GTX690 cards

I’m running on Ubuntu 12.04 LTS, 64-bit install

I installed these packages:
CUDA 4.2 Toolkit
CUDA 5.0 driver
openmpi1.5-bin, v1.5.4 (I have a couple of the above machines)
Intel Compiler 13.0 Update 1
Intel MKL 11.0 Update 1
hpl-2.0_FERMI_v15

This is my run_linpack file:

#!/bin/bash
export HPL_DIR=/cluster/setup/hpl-2.0_FERMI_v15
export OMP_NUM_THREADS=2       #number of cpu cores per process
export MKL_NUM_THREADS=1       #number of cpu cores per GPU used
export MKL_DYNAMIC=TRUE
export CUDA_DGEMM_SPLIT=0.836  #how much work to offload to GPU for DGEMM
export CUDA_DTRSM_SPLIT=0.806    #how much work to offload to GPU for DTRSM
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
$HPL_DIR/bin/CM01/xhpl

My HPL.dat file:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
41216         Ns
1            # of NBs
1024           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

This is my Make.CUDA file:

# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = CUDA
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
# Set TOPdir to the location of where this is being built
ifndef  TOPdir
TOPdir = /cluster/setup/hpl-2.0_FERMI_v15
endif
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/lib/openmpi
MPinc        = -I$(MPdir)/include
MPlib        = /usr/lib/libmpi.a
MPlib        = /usr/lib/libmpich.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /opt/intel/composer_xe_2013.1.117/mkl/lib/intel64
LAinc        = -I/usr/local/cuda/include
# CUDA
LAlib        = -L/cluster/setup/hpl-2.0_FERMI_v15/src/cuda  -ldgemm -L/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/usr/local/cuda/include
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_DETAILED_TIMING  enable detailed timers;
# -DASYOUGO              enable timing information as you go (nonintrusive)
# -DASYOUGO2             slightly intrusive timing information
# -DASYOUGO2_DISPLAY     display detailed DGEMM information
# -DENDEARLY             end the problem early
# -DFASTSWAP             insert to use DLASWP instead of HPL code
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     =  -DCUDA
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
# next two lines for GNU Compilers:
CC      = mpicc
CCFLAGS = $(HPL_DEFS) -O3 -fomit-frame-pointer -funroll-loops -W -Wall -fopenmp
# next two lines for Intel Compilers:
# CC      = mpicc
#CCFLAGS = $(HPL_DEFS) -O3 -axS -w -fomit-frame-pointer -funroll-loops -openmp
#
CCNOOPT      = $(HPL_DEFS) -O0 -w
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = $(CC)
#LINKFLAGS    = $(CCFLAGS) -static_mpi
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------
MAKE = make TOPdir=$(TOPdir)

The GPUs seem to show up:

Fri Jan 18 17:21:49 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 690          | 0000:07:00.0     N/A |                  N/A |
| 30%   39C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 690          | 0000:08:00.0     N/A |                  N/A |
| 30%   36C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 690          | 0000:03:00.0     N/A |                  N/A |
| 30%   42C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 690          | 0000:04:00.0     N/A |                  N/A |
| 30%   42C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 690          | 0000:0B:00.0     N/A |                  N/A |
| 30%   37C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 690          | 0000:0C:00.0     N/A |                  N/A |
| 30%   37C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 690          | 0000:0F:00.0     N/A |                  N/A |
| 30%   30C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 690          | 0000:10:00.0     N/A |                  N/A |
| 30%   30C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
|    2            Not Supported                                               |
|    3            Not Supported                                               |
|    4            Not Supported                                               |
|    5            Not Supported                                               |
|    6            Not Supported                                               |
|    7            Not Supported                                               |
+-----------------------------------------------------------------------------+

This is a LDD of the xhpl file:

ldd xhpl
        linux-vdso.so.1 =>  (0x00007fffb00e6000)
        libdgemm.so.1 => /cluster/lib/libdgemm.so.1 (0x00007f00c1d0a000)
        libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00007f00c1aac000)
        libcublas.so.4 => /usr/local/cuda/lib64/libcublas.so.4 (0x00007f00bb0ad000)
        libmkl_intel_lp64.so => /usr/local/cuda/lib64/libmkl_intel_lp64.so (0x00007f00ba8c7000)
        libmkl_intel_thread.so => /usr/local/cuda/lib64/libmkl_intel_thread.so (0x00007f00b9848000)
        libmkl_core.so => /usr/local/cuda/lib64/libmkl_core.so (0x00007f00b87c8000)
        libiomp5.so => /cluster/lib/libiomp5.so (0x00007f00b84c6000)
        libmpl.so.1 => /usr/lib/libmpl.so.1 (0x00007f00b82c1000)
        libcr.so.0 => /usr/lib/libcr.so.0 (0x00007f00b80b6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f00b7e99000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f00b7ada000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f00b6ef2000)
        libmpich.so.3 => /usr/lib/libmpich.so.3 (0x00007f00b6b15000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f00b6911000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f00b6708000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f00b6408000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f00b610c000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f00b5ef5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f00c1f27000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f00b5cde000

In src/cuda/Makefile enable the verbose print and disable the Fermi specific kernels.
Performances of the 690s are going to be very low.

Hi,

I’m trying to run HPL compiled using CUDA.

What I have on the system:

  1. Nvidia driver (TESLA DRIVER FOR LINUX RHEL 7, Version: 390.30)
  2. CUDA installation toolkit (cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64.rpm)
  3. Mpich (mpich-3.2.1)
  4. openBLAS (OpenBLAS-0.2.20)
  5. HPL provided by nvidia (hpl-2.0_FERMI_v15_latest)

When I run the code it complaints about missing libmkl_intel_lp64.so but when i do ldd for xhpl, it shows no dependency on libmkl_intel_lp64.so. Moreover i’ve used openblas to compile HPL not sure why is it asking for an intel make library!

Any suggestions.

What I run:
/root/hpl/mpich/bin/mpirun -np 1 -hostfile nodes ./run_linpack

Error:
libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

Dependency tree:
ldd xhpl
linux-vdso.so.1 => (0x00007ffeb597c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd1e10f4000)
libdgemm.so.1 => /root/hpl/hpl-2.0_FERMI_v15_latest/src/cuda/libdgemm.so.1 (0x00007fd1e0eeb000)
libcublas.so.9.1 => /usr/local/cuda/lib64/libcublas.so.9.1 (0x00007fd1dd954000)
libcuda.so.1 => /usr/lib64/nvidia/libcuda.so.1 (0x00007fd1dcdb4000)
libcudart.so.9.1 => /usr/local/cuda/lib64/libcudart.so.9.1 (0x00007fd1dcb45000)
libopenblas.so.0 => /root/hpl/openblas/lib/libopenblas.so.0 (0x00007fd1dbbb6000)
libmpi.so.12 => /root/hpl/mpich/lib/libmpi.so.12 (0x00007fd1db737000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fd1db510000)
libc.so.6 => /lib64/libc.so.6 (0x00007fd1db14d000)
/lib64/ld-linux-x86-64.so.2 (0x00005633fe7d7000)
librt.so.1 => /lib64/librt.so.1 (0x00007fd1daf45000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fd1dad40000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fd1daa38000)
libm.so.6 => /lib64/libm.so.6 (0x00007fd1da736000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fd1da51f000)
libnvidia-fatbinaryloader.so.390.30 => /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.390.30 (0x00007fd1da2d3000)
libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007fd1d9fb1000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007fd1d9d74000)

*Can’t see libmkl_intel_lp64.so mentioned anywhere there but still everytime i run HPL it complaints about missing libmkl_intel_lp64.so

Thanks,
Karan

my guess would be that your Make.CUDA file is calling out a link dependency on MKL.

For example the libdgemm.so that gets built may be depending on MKL.

try lddtree instead of ldd

note that lddtree is not part of a standard linux install; you’ll need to google around and learn how to install it yourself if you want to use it. Another similar tool is tldd.

However, if you don’t want to install something, just run ldd on the dependent libraries such as libdgemm.so

But it’s probably even simpler if you just carefully inspect and parse the Make.CUDA

Having said all that, I think you’re unlikely to get interesting or useful results using that very old version of HPL on any modern (e.g. Maxwell, Pascal, or Volta) GPUs. It was designed with Fermi in mind.

If you google around on these forums for other references to HPL, you’ll find more description about this; I’m not going to repeat it all here.