High-performace linpack benchmark

Hello to all,

I’ve been trying to compile the HPL benchmark on a myrinet/mpich opteron cluster.
Our mpich has been compiled with pgf90 and pgcc (by microway). For the life of me
i can’t get past this point in the compilation:

…blah-blah…

pgf90 -fast -Mconcur -Minline=saxpy,sscal -Minfo -I/home/users/faculty/peterp/Test.d/new_hpl/hpl/include -I/home/users/faculty/peterp/Test.d/new_hpl/hpl/include/Linux_OPT_FBLAS -I/usr/rels/mpich/include -o /home/users/faculty/peterp/Test.d/new_hpl/hpl/bin/Linux_OPT_FBLAS/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /home/users/faculty/peterp/Test.d/new_hpl/hpl/lib/Linux_OPT_FBLAS/libhpl.a /usr/local/lib/libf77blas.a /usr/local/lib/libatlas.a /usr/rels/mpich/lib/libmpich.a
HPL_pddriver.o(.text+0x0): In function main': : multiple definition of main’
/usr/pgi/linux86-64/5.2/lib/f90main.o(.text+0x0): first defined here
/usr/bin/ld: Warning: size of symbol main' changed from 94 in /usr/pgi/linux86-64/5.2/lib/f90main.o to 2226 in HPL_pddriver.o /usr/pgi/linux86-64/5.2/lib/f90main.o(.text+0x3c): In function main’:
: undefined reference to MAIN_' /usr/rels/mpich/lib/libmpich.a(gmpi_regcache.o)(.text+0x76f): In function gmpi_regcache_init’:
: undefined reference to gm_hash_compare_ptrs' /usr/rels/mpich/lib/libmpich.a(gmpi_regcache.o)(.text+0x774): In function gmpi_regcache_init’:
: undefined reference to gm_hash_hash_ptr' /usr/rels/mpich/lib/libmpich.a(gmpi_regcache.o)(.text+0x786): In function gmpi_regcache_init’:
: undefined reference to gm_create_hash' /usr/rels/mpich/lib/libmpich.a(gmpi_regcache.o)(.text+0x79c): In function gmpi_regcache_init’:
: undefined reference to gm_create_lookaside' /usr/rels/mpich/lib/libmpich.a(gmpi_regcache.o)(.text+0x7ec): In function gmpi_regcache_deregister’:
: undefined reference to GM_PAGE_LEN' /usr/rels/mpich/lib/libmpich.a(gmpi_regcache.o)(.text+0x7ff): In function gmpi_regcache_deregister’:
: undefined reference to gm_deregister_memory' ...lots of errors of similar kind... ...etc. etc... : undefined reference to gm_destroy_lookaside’
make[2]: *** [dexe.grd] Error 2
make[2]: Leaving directory /home/users/faculty/peterp/Test.d/new_hpl/hpl/testing/ptest/Linux_OPT_FBLAS' make[1]: *** [build_tst] Error 2 make[1]: Leaving directory /home/users/faculty/peterp/Test.d/new_hpl/hpl’
make: *** [build] Error 2

Any suggestions ?

Thanks,
Peter

ps. does PGroup offer an hpl.tar witht he makes modified for their compiler suite ?

Hi Peter,

The ‘multiple definition of main’ error occurs when you link C source code containing a ‘main’ function using a Fortran compiler. Fortran programs need to have ‘main’ added at link time. Adding “-Mnomain” to the link line tells the compiler to not insert a ‘main’ function and will fix the problem. The other undefined reference errors are caused because your missing the GM library on your link like. Adding '-L/path/to/gm/library -lgm" should fix it. You might need to add “-lpthread” as well.

We don’t have a preconfigured makefile, but I’ll investigate if we can add a PGI HPL guide to our support pages. Although I haven’t worked with HPL enough to know what the optimal flag set is, ‘-fastsse’ generally gives the best performance. Also I’d remove “-Mconcur” since it is used to auto-parallelize on SMP systems.

Hope this helps,
Mat

I love you guys…you’ll get an acnowledgement on the first paper to
come out of using this cluster…

Everything compiled fine (lots of loops unrolled…) and a small test case
run fine.

As a final bother, when you get the chance can you look over the make
I include below and please let me know if I need to change anything ?
I have no idea what should go in $CCNOOPT so I improvised…

Best regards,
Peter
_____Here’s the Make.arch

----------------------------------------------------------------------

- shell --------------------------------------------------------------

----------------------------------------------------------------------

SHELL = /bin/sh

CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch

----------------------------------------------------------------------

- Platform identifier ------------------------------------------------

----------------------------------------------------------------------

ARCH = Linux_OPT_FBLAS

----------------------------------------------------------------------

- HPL Directory Structure / HPL library ------------------------------

----------------------------------------------------------------------

TOPdir = $(HOME)/Test.d/new_hpl/hpl
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)

HPLlib = $(LIBdir)/libhpl.a

----------------------------------------------------------------------

- Message Passing library (MPI) --------------------------------------

----------------------------------------------------------------------

MPinc tells the C compiler where to find the Message Passing library

header files, MPlib is defined to be the name of the library to be

used. The variable MPdir is only used for defining MPinc and MPlib.

MPdir = /usr/rels/mpich
MPinc = -I$(MPdir)/include
MPlib = $(MPdir)/lib/libmpich.a

----------------------------------------------------------------------

- Linear Algebra library (BLAS or VSIPL) -----------------------------

----------------------------------------------------------------------

LAinc tells the C compiler where to find the Linear Algebra library

header files, LAlib is defined to be the name of the library to be

used. The variable LAdir is only used for defining LAinc and LAlib.

LAdir = /usr/local/lib
LAinc =
LAlib = $(LAdir)/libf77blas.a $(LAdir)/libatlas.a

----------------------------------------------------------------------

- F77 / C interface --------------------------------------------------

----------------------------------------------------------------------

You can skip this section if and only if you are not planning to use

a BLAS library featuring a Fortran 77 interface. Otherwise, it is

necessary to fill out the F2CDEFS variable with the appropriate

options. One and only one option should be chosen in each of

the 3 following categories:

1) name space (How C calls a Fortran 77 routine)

-DAdd_ : all lower case and a suffixed underscore (Suns,

Intel, …), [default]

-DNoChange : all lower case (IBM RS6000),

-DUpCase : all upper case (Cray),

-DAdd__ : the FORTRAN compiler in use is f2c.

2) C and Fortran 77 integer mapping

-DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]

-DF77_INTEGER=long : Fortran 77 INTEGER is a C long,

-DF77_INTEGER=short : Fortran 77 INTEGER is a C short.

3) Fortran 77 string handling

-DStringSunStyle : The string address is passed at the string loca-

tion on the stack, and the string length is then

passed as an F77_INTEGER after all explicit

stack arguments, [default]

-DStringStructPtr : The address of a structure is passed by a

Fortran 77 string, and the structure is of the

form: struct {char *cp; F77_INTEGER len;},

-DStringStructVal : A structure is passed by value for each Fortran

77 string, and the structure is of the form:

struct {char *cp; F77_INTEGER len;},

-DStringCrayStyle : Special option for Cray machines, which uses

Cray fcd (fortran character descriptor) for

interoperation.

F2CDEFS =

----------------------------------------------------------------------

- HPL includes / libraries / specifics -------------------------------

----------------------------------------------------------------------

HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)

- Compile time options -----------------------------------------------

-DHPL_COPY_L force the copy of the panel L before bcast;

-DHPL_CALL_CBLAS call the cblas interface;

-DHPL_CALL_VSIPL call the vsip library;

-DHPL_DETAILED_TIMING enable detailed timers;

By default HPL will:

*) not copy L before broadcast,

*) call the BLAS Fortran 77 interface,

*) not display detailed timing information.

HPL_OPTS = -fastsse -Minline=saxpy,sscal -Minfo -lpthread

----------------------------------------------------------------------

HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)

----------------------------------------------------------------------

- Compilers / linkers - Optimization flags ---------------------------

----------------------------------------------------------------------

CC = pgcc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS)

LINKER = pgf90 -Mnomain
LINKFLAGS = $(CCFLAGS) -L/opt/gm/lib64 -lgm

ARCHIVER = ar
ARFLAGS = r
RANLIB = echo

----------------------------------------------------------------------

Thanks Peter, we appreciate the compliment.

Whenever you see a ‘NOOPT’ type make variable, it’s usually because the authors are either working around a bug and need to compile a particular file without optimization or don’t want the compiler to optimize a file. In this case, ‘CCNOOPT’ is used to compile the file src/auxil/HPL_dlamch.c, which determines some machine specific arithmetic constants. Compiler optimizations can reorder operations so that the code is no longer strictly compilant to the IEEE 754 floating point arithmetic standard. The authors most likely want this file to be strictly compilant so I’d set ‘CCNOOPT=-O0 -Kieee’. It should not effect the overall performance but give you more acturate results.

  • Mat

Thanks Mat, I’ll try the suggestion…apart from that, can you suggest any other
flags that may be beneficial ?

Best regards,
Peter

Hi Peter,

Since I haven’t specifically looked at HPL, I don’t know which flags give the best performance. For 64-bit systems in general, the aggregate flag ‘-fastsse’ gives very good performance. ‘-fastsse’ is roughly equivlent to ‘-O2 -Munroll=c:2 -Mlre -Mnoframe -Mscalarsse -Mvect=sse -Mcache_align -Mflushz’. If you have the time to experiment, some other options to try are:

-fastsse -Minline=levels:2 ← adjust the levels as needed
-fastsse -Mipa=fast,inline,safe
-fastsse -Munroll=n:4 ← adjust the number of time to unroll as needed
-fastsse -O3
-fastsse -Mipa=fast,libinline,libopt ← may not work since the libraries were not compiled with IPA
-fastsse -Mipa=fast,safe

Also try mixing and matching the options. For example if both "-O3’ and IPA inlining help, try ‘-fastsse -O3 -Mipa=fast,inline’

You can also experiment with -Mprefetch but prefetching generally only helps memory bounded codes which I don’t believe this code is.

Let us know what you find out,
Mat

Hi there,

Here’s a typical result on the linpack benchmark with
HPL_OPTS = -fastsse -Minline=saxpy,sscal -Minfo -lpthread
and
CCNOOPT = -O0 -Kieee
on 81 Opteron 2.4 GHz processors:

T/V N NB P Q Time Gflops

WR01L3R2 30000 80 9 9 85.68 2.101e+02

||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0094769 … PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0151869 … PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0030144 … PASSED

Thanks for the help.

Best regards,
Peter