HPL on Kepler GPUs

Rorikk · January 19, 2013, 1:24am

I’m trying to do benchmarks on a couple machines which have GTX690 cards installed. When I run a benchmark it does not appear to touch the GPUs and the results I receive are identical to machines with no GPUs.

I have suspicions that this may be because the GTX690 cards are Kepler versus Fermi but wonder if that is the case if there is any way to run HPL using the GPUs I have?

Thank you!

This is the hardware I’m running on:
Intel i7 3930K CPU
32GB
4 x GTX690 cards

I’m running on Ubuntu 12.04 LTS, 64-bit install

I installed these packages:
CUDA 4.2 Toolkit
CUDA 5.0 driver
openmpi1.5-bin, v1.5.4 (I have a couple of the above machines)
Intel Compiler 13.0 Update 1
Intel MKL 11.0 Update 1
hpl-2.0_FERMI_v15

This is my run_linpack file:

#!/bin/bash
export HPL_DIR=/cluster/setup/hpl-2.0_FERMI_v15
export OMP_NUM_THREADS=2       #number of cpu cores per process
export MKL_NUM_THREADS=1       #number of cpu cores per GPU used
export MKL_DYNAMIC=TRUE
export CUDA_DGEMM_SPLIT=0.836  #how much work to offload to GPU for DGEMM
export CUDA_DTRSM_SPLIT=0.806    #how much work to offload to GPU for DTRSM
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
$HPL_DIR/bin/CM01/xhpl

My HPL.dat file:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
41216         Ns
1            # of NBs
1024           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

This is my Make.CUDA file:

# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = CUDA
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
# Set TOPdir to the location of where this is being built
ifndef  TOPdir
TOPdir = /cluster/setup/hpl-2.0_FERMI_v15
endif
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/lib/openmpi
MPinc        = -I$(MPdir)/include
MPlib        = /usr/lib/libmpi.a
MPlib        = /usr/lib/libmpich.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /opt/intel/composer_xe_2013.1.117/mkl/lib/intel64
LAinc        = -I/usr/local/cuda/include
# CUDA
LAlib        = -L/cluster/setup/hpl-2.0_FERMI_v15/src/cuda  -ldgemm -L/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/usr/local/cuda/include
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_DETAILED_TIMING  enable detailed timers;
# -DASYOUGO              enable timing information as you go (nonintrusive)
# -DASYOUGO2             slightly intrusive timing information
# -DASYOUGO2_DISPLAY     display detailed DGEMM information
# -DENDEARLY             end the problem early
# -DFASTSWAP             insert to use DLASWP instead of HPL code
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     =  -DCUDA
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
# next two lines for GNU Compilers:
CC      = mpicc
CCFLAGS = $(HPL_DEFS) -O3 -fomit-frame-pointer -funroll-loops -W -Wall -fopenmp
# next two lines for Intel Compilers:
# CC      = mpicc
#CCFLAGS = $(HPL_DEFS) -O3 -axS -w -fomit-frame-pointer -funroll-loops -openmp
#
CCNOOPT      = $(HPL_DEFS) -O0 -w
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = $(CC)
#LINKFLAGS    = $(CCFLAGS) -static_mpi
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------
MAKE = make TOPdir=$(TOPdir)

The GPUs seem to show up:

Fri Jan 18 17:21:49 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 690          | 0000:07:00.0     N/A |                  N/A |
| 30%   39C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 690          | 0000:08:00.0     N/A |                  N/A |
| 30%   36C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 690          | 0000:03:00.0     N/A |                  N/A |
| 30%   42C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 690          | 0000:04:00.0     N/A |                  N/A |
| 30%   42C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 690          | 0000:0B:00.0     N/A |                  N/A |
| 30%   37C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 690          | 0000:0C:00.0     N/A |                  N/A |
| 30%   37C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 690          | 0000:0F:00.0     N/A |                  N/A |
| 30%   30C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 690          | 0000:10:00.0     N/A |                  N/A |
| 30%   30C  N/A     N/A /  N/A |   0%    7MB / 2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
|    1            Not Supported                                               |
|    2            Not Supported                                               |
|    3            Not Supported                                               |
|    4            Not Supported                                               |
|    5            Not Supported                                               |
|    6            Not Supported                                               |
|    7            Not Supported                                               |
+-----------------------------------------------------------------------------+

This is a LDD of the xhpl file:

ldd xhpl
        linux-vdso.so.1 =>  (0x00007fffb00e6000)
        libdgemm.so.1 => /cluster/lib/libdgemm.so.1 (0x00007f00c1d0a000)
        libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00007f00c1aac000)
        libcublas.so.4 => /usr/local/cuda/lib64/libcublas.so.4 (0x00007f00bb0ad000)
        libmkl_intel_lp64.so => /usr/local/cuda/lib64/libmkl_intel_lp64.so (0x00007f00ba8c7000)
        libmkl_intel_thread.so => /usr/local/cuda/lib64/libmkl_intel_thread.so (0x00007f00b9848000)
        libmkl_core.so => /usr/local/cuda/lib64/libmkl_core.so (0x00007f00b87c8000)
        libiomp5.so => /cluster/lib/libiomp5.so (0x00007f00b84c6000)
        libmpl.so.1 => /usr/lib/libmpl.so.1 (0x00007f00b82c1000)
        libcr.so.0 => /usr/lib/libcr.so.0 (0x00007f00b80b6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f00b7e99000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f00b7ada000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f00b6ef2000)
        libmpich.so.3 => /usr/lib/libmpich.so.3 (0x00007f00b6b15000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f00b6911000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f00b6708000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f00b6408000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f00b610c000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f00b5ef5000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f00c1f27000)
        libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f00b5cde000

mfatica · January 20, 2013, 1:59am

In src/cuda/Makefile enable the verbose print and disable the Fermi specific kernels.
Performances of the 690s are going to be very low.

karanchhabra2013 · March 12, 2018, 5:19pm

Hi,

I’m trying to run HPL compiled using CUDA.

What I have on the system:

Nvidia driver (TESLA DRIVER FOR LINUX RHEL 7, Version: 390.30)
CUDA installation toolkit (cuda-repo-rhel7-9-1-local-9.1.85-1.x86_64.rpm)
Mpich (mpich-3.2.1)
openBLAS (OpenBLAS-0.2.20)
HPL provided by nvidia (hpl-2.0_FERMI_v15_latest)

When I run the code it complaints about missing libmkl_intel_lp64.so but when i do ldd for xhpl, it shows no dependency on libmkl_intel_lp64.so. Moreover i’ve used openblas to compile HPL not sure why is it asking for an intel make library!

Any suggestions.

What I run:
/root/hpl/mpich/bin/mpirun -np 1 -hostfile nodes ./run_linpack

Error:
libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

Dependency tree:
ldd xhpl
linux-vdso.so.1 => (0x00007ffeb597c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fd1e10f4000)
libdgemm.so.1 => /root/hpl/hpl-2.0_FERMI_v15_latest/src/cuda/libdgemm.so.1 (0x00007fd1e0eeb000)
libcublas.so.9.1 => /usr/local/cuda/lib64/libcublas.so.9.1 (0x00007fd1dd954000)
libcuda.so.1 => /usr/lib64/nvidia/libcuda.so.1 (0x00007fd1dcdb4000)
libcudart.so.9.1 => /usr/local/cuda/lib64/libcudart.so.9.1 (0x00007fd1dcb45000)
libopenblas.so.0 => /root/hpl/openblas/lib/libopenblas.so.0 (0x00007fd1dbbb6000)
libmpi.so.12 => /root/hpl/mpich/lib/libmpi.so.12 (0x00007fd1db737000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007fd1db510000)
libc.so.6 => /lib64/libc.so.6 (0x00007fd1db14d000)
/lib64/ld-linux-x86-64.so.2 (0x00005633fe7d7000)
librt.so.1 => /lib64/librt.so.1 (0x00007fd1daf45000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fd1dad40000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fd1daa38000)
libm.so.6 => /lib64/libm.so.6 (0x00007fd1da736000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fd1da51f000)
libnvidia-fatbinaryloader.so.390.30 => /usr/lib64/nvidia/libnvidia-fatbinaryloader.so.390.30 (0x00007fd1da2d3000)
libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007fd1d9fb1000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007fd1d9d74000)

*Can’t see libmkl_intel_lp64.so mentioned anywhere there but still everytime i run HPL it complaints about missing libmkl_intel_lp64.so

Thanks,
Karan

Robert_Crovella · March 12, 2018, 7:34pm

my guess would be that your Make.CUDA file is calling out a link dependency on MKL.

For example the libdgemm.so that gets built may be depending on MKL.

try lddtree instead of ldd

note that lddtree is not part of a standard linux install; you’ll need to google around and learn how to install it yourself if you want to use it. Another similar tool is tldd.

However, if you don’t want to install something, just run ldd on the dependent libraries such as libdgemm.so

But it’s probably even simpler if you just carefully inspect and parse the Make.CUDA

Having said all that, I think you’re unlikely to get interesting or useful results using that very old version of HPL on any modern (e.g. Maxwell, Pascal, or Volta) GPUs. It was designed with Fermi in mind.

If you google around on these forums for other references to HPL, you’ll find more description about this; I’m not going to repeat it all here.

Topic		Replies	Views
LinPack HPL to benchmark NVIDIA GPUs CUDA Programming and Performance	18	16696	March 8, 2018
HPL CUDA Programming and Performance	11	42585	July 18, 2011
HPL run fails (libmkl_intel_lp64.so: cannot open shared object file) CUDA Setup and Installation	1	1953	April 2, 2018
Installation of Linpack for Fermi CUDA Programming and Performance	2	26087	March 8, 2018
where to find the hpl 2.0 for CUDA CUDA Programming and Performance	0	1050	March 21, 2011
HPLinpack for CUDA Any interest? CUDA Programming and Performance	27	12239	May 10, 2012
Run HPL benckmark 23.3 on A800(80GB) GPU-Accelerated Libraries cuda	0	1251	April 20, 2023
HPL for V100 CUDA Programming and Performance	3	981	January 25, 2024
hpl-2.0_v15 compile Problem - undefined reference to stuffs within libhpl.a CUDA Programming and Performance	6	3504	October 22, 2019
Compiling HPL for CUDA CUDA Programming and Performance	3	7968	March 8, 2018

HPL on Kepler GPUs

Related topics