I’m trying to do benchmarks on a couple machines which have GTX690 cards installed. When I run a benchmark it does not appear to touch the GPUs and the results I receive are identical to machines with no GPUs.
I have suspicions that this may be because the GTX690 cards are Kepler versus Fermi but wonder if that is the case if there is any way to run HPL using the GPUs I have?
Thank you!
This is the hardware I’m running on:
Intel i7 3930K CPU
32GB
4 x GTX690 cards
I’m running on Ubuntu 12.04 LTS, 64-bit install
I installed these packages:
CUDA 4.2 Toolkit
CUDA 5.0 driver
openmpi1.5-bin, v1.5.4 (I have a couple of the above machines)
Intel Compiler 13.0 Update 1
Intel MKL 11.0 Update 1
hpl-2.0_FERMI_v15
This is my run_linpack file:
#!/bin/bash
export HPL_DIR=/cluster/setup/hpl-2.0_FERMI_v15
export OMP_NUM_THREADS=2 #number of cpu cores per process
export MKL_NUM_THREADS=1 #number of cpu cores per GPU used
export MKL_DYNAMIC=TRUE
export CUDA_DGEMM_SPLIT=0.836 #how much work to offload to GPU for DGEMM
export CUDA_DTRSM_SPLIT=0.806 #how much work to offload to GPU for DTRSM
export LD_LIBRARY_PATH=$HPL_DIR/src/cuda:$LD_LIBRARY_PATH
$HPL_DIR/bin/CM01/xhpl
My HPL.dat file:
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
41216 Ns
1 # of NBs
1024 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
1 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
This is my Make.CUDA file:
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL = /bin/sh
#
CD = cd
CP = cp
LN_S = ln -fs
MKDIR = mkdir -p
RM = /bin/rm -f
TOUCH = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH = CUDA
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
# Set TOPdir to the location of where this is being built
ifndef TOPdir
TOPdir = /cluster/setup/hpl-2.0_FERMI_v15
endif
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the C compiler where to find the Message Passing library
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir = /usr/lib/openmpi
MPinc = -I$(MPdir)/include
MPlib = /usr/lib/libmpi.a
MPlib = /usr/lib/libmpich.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the C compiler where to find the Linear Algebra library
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir = /opt/intel/composer_xe_2013.1.117/mkl/lib/intel64
LAinc = -I/usr/local/cuda/include
# CUDA
LAlib = -L/cluster/setup/hpl-2.0_FERMI_v15/src/cuda -ldgemm -L/opt/intel/composer_xe_2013.1.117/compiler/lib/intel64 -L/usr/local/cuda/lib64 -lcuda -lcudart -lcublas -L$(LAdir) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section if and only if you are not planning to use
# a BLAS library featuring a Fortran 77 interface. Otherwise, it is
# necessary to fill out the F2CDEFS variable with the appropriate
# options. **One and only one** option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_ : all lower case and a suffixed underscore (Suns,
# Intel, ...), [default]
# -DNoChange : all lower case (IBM RS6000),
# -DUpCase : all upper case (Cray),
# -DAdd__ : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]
# -DF77_INTEGER=long : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle : The string address is passed at the string loca-
# tion on the stack, and the string length is then
# passed as an F77_INTEGER after all explicit
# stack arguments, [default]
# -DStringStructPtr : The address of a structure is passed by a
# Fortran 77 string, and the structure is of the
# form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal : A structure is passed by value for each Fortran
# 77 string, and the structure is of the form:
# struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle : Special option for Cray machines, which uses
# Cray fcd (fortran character descriptor) for
# interoperation.
#
F2CDEFS = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/usr/local/cuda/include
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS call the cblas interface;
# -DHPL_DETAILED_TIMING enable detailed timers;
# -DASYOUGO enable timing information as you go (nonintrusive)
# -DASYOUGO2 slightly intrusive timing information
# -DASYOUGO2_DISPLAY display detailed DGEMM information
# -DENDEARLY end the problem early
# -DFASTSWAP insert to use DLASWP instead of HPL code
#
# By default HPL will:
# *) not copy L before broadcast,
# *) call the BLAS Fortran 77 interface,
# *) not display detailed timing information.
#
HPL_OPTS = -DCUDA
# ----------------------------------------------------------------------
#
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
# next two lines for GNU Compilers:
CC = mpicc
CCFLAGS = $(HPL_DEFS) -O3 -fomit-frame-pointer -funroll-loops -W -Wall -fopenmp
# next two lines for Intel Compilers:
# CC = mpicc
#CCFLAGS = $(HPL_DEFS) -O3 -axS -w -fomit-frame-pointer -funroll-loops -openmp
#
CCNOOPT = $(HPL_DEFS) -O0 -w
#
# On some platforms, it is necessary to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER = $(CC)
#LINKFLAGS = $(CCFLAGS) -static_mpi
LINKFLAGS = $(CCFLAGS)
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
#
# ----------------------------------------------------------------------
MAKE = make TOPdir=$(TOPdir)
The GPUs seem to show up:
Fri Jan 18 17:21:49 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54 Driver Version: 304.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 | 0000:07:00.0 N/A | N/A |
| 30% 39C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 690 | 0000:08:00.0 N/A | N/A |
| 30% 36C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 690 | 0000:03:00.0 N/A | N/A |
| 30% 42C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 690 | 0000:04:00.0 N/A | N/A |
| 30% 42C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX 690 | 0000:0B:00.0 N/A | N/A |
| 30% 37C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX 690 | 0000:0C:00.0 N/A | N/A |
| 30% 37C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 690 | 0000:0F:00.0 N/A | N/A |
| 30% 30C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX 690 | 0000:10:00.0 N/A | N/A |
| 30% 30C N/A N/A / N/A | 0% 7MB / 2047MB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 Not Supported |
| 3 Not Supported |
| 4 Not Supported |
| 5 Not Supported |
| 6 Not Supported |
| 7 Not Supported |
+-----------------------------------------------------------------------------+
This is a LDD of the xhpl file:
ldd xhpl
linux-vdso.so.1 => (0x00007fffb00e6000)
libdgemm.so.1 => /cluster/lib/libdgemm.so.1 (0x00007f00c1d0a000)
libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00007f00c1aac000)
libcublas.so.4 => /usr/local/cuda/lib64/libcublas.so.4 (0x00007f00bb0ad000)
libmkl_intel_lp64.so => /usr/local/cuda/lib64/libmkl_intel_lp64.so (0x00007f00ba8c7000)
libmkl_intel_thread.so => /usr/local/cuda/lib64/libmkl_intel_thread.so (0x00007f00b9848000)
libmkl_core.so => /usr/local/cuda/lib64/libmkl_core.so (0x00007f00b87c8000)
libiomp5.so => /cluster/lib/libiomp5.so (0x00007f00b84c6000)
libmpl.so.1 => /usr/lib/libmpl.so.1 (0x00007f00b82c1000)
libcr.so.0 => /usr/lib/libcr.so.0 (0x00007f00b80b6000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f00b7e99000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f00b7ada000)
libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f00b6ef2000)
libmpich.so.3 => /usr/lib/libmpich.so.3 (0x00007f00b6b15000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f00b6911000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f00b6708000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f00b6408000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f00b610c000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f00b5ef5000)
/lib64/ld-linux-x86-64.so.2 (0x00007f00c1f27000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f00b5cde000