CUDA accelerated Linpack not running, undefined symbol dtrsm

Hello everyone,

I’m struggling to run the CUDA accelerated Linpack benchmark on my university’s cluster. I got the benchmark from here: https://developer.nvidia.com/rdp/assets/cuda-accelerated-linpack-linux64

Building it seems to work, I do not get any errors there. But when I try to run the benchmark with

mpirun -np 4 ./run_linpack

I get the following output and error:

================================================================================
HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   25000 
NB     :     768 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :   1ring 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : no-transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

/cluster/lib/openblas/lib/libopenblas.so.0: undefined symbol: dtrsm
/cluster/lib/openblas/lib/libopenblas.so.0: undefined symbol: dtrsm
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------

There seems to be some problem with the OPENBLAS library (undefined symbol: dtrsm). I was unsuccessfull to fix that and couldn’t find any help online yet. I hope someone here has an idea what I should do/try next.

If you need more information to assist me I will gladly provide them.
Kind regards
Lukas

You’ll probably need to find out what BLAS options exist on your university cluster, and try another one (i.e. link against another one, and make sure it is available at run-time). If you had root access to the machine(s) then I would suggest simply installing another one, but that probably won’t be possible if you’re an ordinary user.

Many clusters organize their available software packages using the module system, so that may be a place to start to find out what is available.

Certainly any university HPC organization should be able to explain to use how to find a CPU BLAS
implementation that provides the dtrsm routine.

You could also provide your own openblas in your own workspace, and link against that:

http://www.openblas.net/

As far as I know there is only openblas 0.2.19 on our machine. Should that contain said dtrsm routine? If so, is it possible that my linking results in the error? It was a little tricky to link against the openblas library and I’m not sure if I did everything correctly.

I’m quite certain that the openblas module is/was loaded when I tried to run the benchmark.

I myself do not have root access, but I could ask the support team to install a different BLAS option if needed.

I believe that is the most recent openblas and I believe it should provide dtrsm routine.

Probably the next step is to review your compile/link sequence.

I added my Make.CUDA file at the end. The link sequence should be in there right? The BLAS part starts at line 47. I am not really sure about LAdir (line 55) and LAlib (line 59). The path to the openblas-directory seems to be the following:

/cluster/lib/openblas/

which includes the directories bin, include and lib.

Make.CUDA

#  
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = CUDA
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
# Set TOPdir to the location of where this is being built
ifndef  TOPdir
TOPdir = .../hpl-2.0_FERMI_v15
endif
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a 
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
#MPdir        = /opt/intel/mpi/3.0
#MPinc        = -I$(MPdir)/include64
#MPlib        = $(MPdir)/lib64/libmpi.a
#MPlib        = $(MPdir)/lib64/libmpich.a
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
#LAdir        = $(TOPdir)/../../lib/em64t
LAdir        = /cluster/lib/openblas/lib
LAinc        =
# CUDA
#LAlib        = -L /home/cuda/Fortran_Cuda_Blas  -ldgemm -L/cluster/cuda/9.0/lib -lcublas  -L$(LAdir) -lmkl -lguide -lpthread
LAlib		= -L/cluster/lib/openblas/lib -lopenblas -lblas -lopenblas_haswellp-r0.2.19 -L $(TOPdir)/src/cuda  -ldgemm -L/cluster/cuda/9.0/lib64 -L/cluster/cuda/9.0/lib64/stubs -lcuda -lcudart -lcublas -L$(LAdir) -lblas
# versuch 2017.11.27 102735 LAlib        = -L $(TOPdir)/src/cuda  -ldgemm -L/cluster/cuda/9.0/lib64 -L/cluster/cuda/9.0/lib64/stubs -lcuda -lcudart -lcublas -L$(LAdir) -lblas
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) -I/cluster/cuda/9.0/include
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_DETAILED_TIMING  enable detailed timers;
# -DASYOUGO              enable timing information as you go (nonintrusive)
# -DASYOUGO2             slightly intrusive timing information
# -DASYOUGO2_DISPLAY     display detailed DGEMM information
# -DENDEARLY             end the problem early  
# -DFASTSWAP             insert to use DLASWP instead of HPL code
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     =  -DCUDA
# ----------------------------------------------------------------------
#
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
# next two lines for GNU Compilers:
CC      = mpicc
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp
# next two lines for Intel Compilers:
# CC      = mpicc
# CCFLAGS = $(HPL_DEFS) -O3 -axS -w -fomit-frame-pointer -funroll-loops -openmp 
#
CCNOOPT      = $(HPL_DEFS) -O0 -w
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = $(CC)
#LINKFLAGS    = $(CCFLAGS) -static_mpi
LINKFLAGS    = $(CCFLAGS) 
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------
MAKE = make TOPdir=$(TOPdir)

did you also modify some source code in src/cuda/cuda_dgemm.c ?

Yes, I changed something there. I removed lines with something about mkl_intel. I thougt MKL is just a BLAS library which I can’t use because it’s not on my cluster. Here is what I changed in src/cuda/cuda_dgemm.c

#ifdef GOTO
      handle = dlopen ("libopenblas.so", RTLD_LAZY);
#endif
#ifdef ACML
      handle = dlopen ("libacml_mp.so", RTLD_LAZY);
#else
      handle = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
#endif

and

#ifdef GOTO
      handle2 = dlopen ("libopenblas.so", RTLD_LAZY);
#endif
#ifdef ACML
      handle2 = dlopen ("libacml_mp.so", RTLD_LAZY);
#else
      handle2 = dlopen ("libmkl_intel_lp64.so", RTLD_LAZY);
#endif

I removed line 6 and 7 in each case. If I don’t remove those four lines I get the following error (instead of the dtrsm error in my first post):

libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

OK it makes sense now.

actually the #ifdef sequences are a little strange looking to me. So let’s do it differently, this is exactly what I did to get it working on centos7/CUDA 8:

  1. make the changes as you did to Make.CUDA and also run_linpack
  2. starting with an unmodified version, modify src/cuda/cuda_dgemm.c as follows:
#ifdef GOTO
      handle2 = dlopen ("libgoto2.so", RTLD_LAZY);
#endif
#ifdef ACML
      handle2 = dlopen ("libacml_mp.so", RTLD_LAZY);
#else
      handle2 = dlopen ("libopenblas.so", RTLD_LAZY);
      /* above line is changed from mkl library to openblas library */
#endif

make the indicated change above on the last of the 3 handle2 lines

#ifdef GOTO
      mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm_");
#endif
#ifdef ACML
      mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm_");
#else
      mkl_dtrsm = (void(*)())dlsym(handle2, "dtrsm_");
      /* on the line above, add the underscore after dtrsm */
#endif

make the indicated change above on the last of the 3 mkl_dtrsm lines

#ifdef GOTO
      handle = dlopen ("libgoto2.so", RTLD_LAZY);
#endif
#ifdef ACML
      handle = dlopen ("libacml_mp.so", RTLD_LAZY);
#else
      handle = dlopen ("libopenblas.so", RTLD_LAZY);
      /* above line is changed from mkl library to openblas library */
#endif

make the indicated change above on the last of the 3 handle lines

#ifdef GOTO
      dgemm_mkl = (void(*)())dlsym(handle, "dgemm_");
#endif
#ifdef ACML
      dgemm_mkl = (void(*)())dlsym(handle, "dgemm_");
#else
      dgemm_mkl = (void(*)())dlsym(handle, "dgemm_");
      /* on the line above, add the underscore after dgemm */
#endif

make the indicated change above on the last of the 3 dgemm_mkl lines

  1. go back to the top directory and do a make clean, and make again.

This is not the only approach that will work of course, but the previous instruction to just define GOTO will not work by itself, I don’t think. For these instructions, you do not need to define GOTO anywhere, we are just using the default mkl sequence but replacing it with the openblas equivalents.

Oh wow, that worked. I’m so happy right now, thank you very very much for your effort and help! I’m no programing expert (studying civil engineering) and I’m really glad you helped me.

================================================================================
HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   25000 
NB     :     768 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :    Left 
NBMIN  :       2 
NDIV   :       2 
RFACT  :    Left 
BCAST  :   1ring 
DEPTH  :       1 
SWAP   : Spread-roll (long)
L1     : no-transposed form
U      : no-transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR10L2L2       25000   768     2     2              30.34              3.433e+02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0022869 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

I didn’t tune anything yet and just used one node, thus the low Gflops…