HPL on cuBlas : Ok, but not on Tesla 1060 Board ! Tesla board crash on large array when launchin

Hello,

I’m working since last week-end on cuBlas porting applications, in order to estimate the difficulties to match a CBLAS ou FBLAS code to CuBLAS.

As my first exercice, I try to adapt the Linpack clone HPL (http://www.netlib.org/benchmark/hpl/), which launches normally under Atlas libraries.

After hours of hard work (my first one…), I succeed in launching xhpl on Cuda supported nVidia board (NVS 160 and 8600 GT).

First remark : I reach more that 200 Gflops (as HPL estimate it) under these two boards

Second remark, when I port the code to a host which holds Tesla 1060, my program crashes, not for small arrays, but for big ones !

For 256 order:

====================
T/V N NB P Q Time Gflops

WR00L2L2 256 256 1 1 0.20 5.708e-02
WR00L2L4 256 256 1 1 0.06 1.809e-01
WR00L2C2 256 256 1 1 0.06 1.833e-01
WR00L2C4 256 256 1 1 0.06 1.834e-01
WR00L2R2 256 256 1 1 0.06 1.863e-01
WR00L2R4 256 256 1 1 0.06 1.879e-01
WR00C2L2 256 256 1 1 0.06 1.856e-01
WR00C2L4 256 256 1 1 0.06 1.875e-01
WR00C2C2 256 256 1 1 0.06 1.893e-01

For 512 order:

[poseidon:17221] *** Process received signal ***
[poseidon:17221] Signal: Floating point exception (8)
[poseidon:17221] Signal code: Integer divide-by-zero (1)
[poseidon:17221] Failing at address: 0x7fdd3aec4b9b
[poseidon:17221] [ 0] /lib/libpthread.so.0 [0x7fdd3957ba80]
[poseidon:17221] [ 1] /usr/local/cuda/lib64/libcublas.so.3 [0x7fdd3aec4b9b]
[poseidon:17221] [ 2] /usr/local/cuda/lib64/libcublas.so.3(cublasDcopy+0x36c) [0x7fdd3af48f8c]

The more funny is that my 2 laptops with shared memory and 4 GB of RAM haven’t any problems and that a Tesla board with 4 GB on a 8Gb host crash…

Not very funny in fact… Any idea ?

So this is a single precision port of HPL using sgemm()? I would be really, really surprised if those two cards could hit single precision 200GFlop/s combined.

I have run cublas dgemm() with m=n=k=12000 (that basically fills 4Gb) without a problem in the past.

To your first mark…

The following parameter values will be used:

N : 5120

NB : 1024

PMAP : Row-major process mapping

P : 1

Q : 2

PFACT : Left Crout Right

NBMIN : 2 4

NDIV : 2

RFACT : Left Crout Right

BCAST : 1ring

DEPTH : 0

SWAP : Spread-roll (long)

L1 : transposed form

U : transposed form

EQUIL : yes

ALIGN : 16 double precision words

============================================================

====================

T/V N NB P Q Time Gflops


WR00L2L2 5120 1024 1 2 7.19 1.245e+01

WR00L2L4 5120 1024 1 2 0.27 3.300e+02

WR00L2C2 5120 1024 1 2 0.28 3.148e+02

WR00L2C4 5120 1024 1 2 0.27 3.295e+02

WR00L2R2 5120 1024 1 2 0.29 3.040e+02

WR00L2R4 5120 1024 1 2 0.27 3.358e+02

WR00C2L2 5120 1024 1 2 0.29 3.066e+02

WR00C2L4 5120 1024 1 2 0.28 3.218e+02

WR00C2C2 5120 1024 1 2 0.29 3.134e+02

WR00C2C4 5120 1024 1 2 0.28 3.199e+02

WR00C2R2 5120 1024 1 2 0.29 3.107e+02

WR00C2R4 5120 1024 1 2 0.27 3.291e+02

WR00R2L2 5120 1024 1 2 0.29 3.069e+02

WR00R2L4 5120 1024 1 2 0.29 3.127e+02

WR00R2C2 5120 1024 1 2 0.29 3.120e+02

WR00R2C4 5120 1024 1 2 0.28 3.240e+02

WR00R2R2 5120 1024 1 2 0.28 3.141e+02

WR00R2R4 5120 1024 1 2 0.28 3.204e+02

============================================================

====================

Finished 18 tests with the following results:

         18 tests completed without checking,

          0 tests skipped because of illegal input values.

End of Tests.

============================================================

====================

To the second one, I rebind all the function in order to match old CBLAS functions to CuBLAS ones.

I will try to do it directly from a simple example.

Cheers

But how are you doing double precision arithmetic on devices that don’t support it? Neither of your laptop cards support double precision.

The 8600GT (32 cores) only has about 120GFlops peak single precision and the NVS160 (8 cores) about 30Gflops. A full desktop 9800GTX with 1GB of memory can only reach about 200Gflops SGEMM on 4096x4096 square matrices, and you are claiming 320Gflops double precision from a a pair of mobile cards with a sum total performance of only about 150GFlops single precision and for cases where the maximum GEMM call size is only m=5120, n= 5120, k=1024.

There is something wrong with this picture.

EDIT: My guess is that your laptop version isn’t actually running anything at all (if you try calling double precision cublas functions on non capable hardware, they just return immediately without doing anything), and because you don’t have checking enabled, you don’t see that the factorization failed.

Just out of interests sake, this is what the results for a single GTX200 class should look like (this is on a PCI-e 1.0 host):

$ bin/CUDA/xhpl 

============================================================

====================

HPLinpack 2.0  --  High-Performance Linpack benchmark  --   September 10, 2008

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK

Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK

Modified by Julien Langou, University of Colorado Denver

============================================================

====================

An explanation of the input/output parameters follows:

T/V	: Wall time / encoded variant.

N	  : The order of the coefficient matrix A.

NB	 : The partitioning blocking factor.

P	  : The number of process rows.

Q	  : The number of process columns.

Time   : Time in seconds to solve the linear system.

Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N	  :   14400 

NB	 :	 768 

PMAP   : Column-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :	Left	Crout	Right 

NBMIN  :	   4 

NDIV   :	   2 

RFACT  :	Left	Crout	Right 

BCAST  :   1ring 

DEPTH  :	   0 

SWAP   : Mix (threshold = 64)

L1	 : no-transposed form

U	  : no-transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

	  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be			   1.110223e-16

- Computational tests pass if scaled residuals are less than				16.0

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC00L2L4	   14400   768	 1	 1			  43.57			  4.569e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0031688 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC00L2C4	   14400   768	 1	 1			  43.37			  4.591e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0025378 ...... PASSED

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC00L2R4	   14400   768	 1	 1			  43.28			  4.600e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=		0.0029530 ...... PASSED

That’s exacty why, when I got the results, I thought that it was a bit “strange”.

This port was a “test” : it was my first travel to Cuda world and I represent scientist who looks like to use GPU for his codes.

I use a library, CBLAS, I want to use a new library, CUBLAS. I compare the prototypes. Both seem to get the same name, the same paramters to be called.

First, I discover that the parameters of CUBLAS were not the same as CBLAS for the types (ORDER, SIDE, etc).

Second, I adapt the code to match it.

Third, I run the codes to several architectures and, (you didn’t notice this perhaps) I’m a bit surprised by the results. It brokes on great card but not laptop cards.

Fourth, I contact the forum to get an explanation.

What did you answer ?

  • the laptop cards does not support double precision.
    R: I call cuBLAS library instead on CBLAS with the same functions (and the same precision, Double). If my board does not support Double, It may break or something like that !
  • the cuBLAS library returns without doing anything,
    R: ok… That’s interesting. So I had to use error codes to know what state return

Ok. you’re right… The change on threshold parameter indicates that there are problem in calls.

So to be constructive, I had a statement for CUBLAS as CBLAS and FBLAS ones in each src/blas/HPL_*.c files.

For example, for HPL_dgemm.c, it’s the following

#ifdef HPL_CALL_CUBLAS

char transa,transb;

if (TRANSA==HplNoTrans)
transa=CudaNoTrans;
else if (TRANSA==HplTrans)
transa=CudaTrans;
else
transa=CudaConjTrans;

if (TRANSB==HplNoTrans)
transb=CudaNoTrans;
else if (TRANSB==HplTrans)
transb=CudaTrans;
else
transb=CudaConjTrans;

if( ORDER == HplColumnMajor )
{
cublasDgemm( transa, transb, M, N, K, ALPHA,A, LDA, B, LDB, BETA, C, LDC);
}
else
{
cublasDgemm( transb, transa, N, M, K, ALPHA, B, LDB, A, LDA, BETA, C, LDC );
}
#endif

Where am I wrong in my approach ? I have to change all the calls by ones like in fortran.c example ?

Thanks !

Unless you are building with the “thunking” intterface turned on (see the cublas documentation for details and also why it isn’t a great idea to do so), you need to add memory management for the GPU - all three matrices need memory allocated for them on the GPU, and then copied to the GPU before the call, and then the C matrix needs to be copied back afterwards. That imposes a lot of overhead, so you certainly don’t want to do it for every call because it won’t be faster than the host processor under all conditions. In HPL, the dgemmNN calls are the ones with the largest matrices which can benefit the most from the GPUs performance, so that is the obvious candidate, and you might want to put a size limit in to determine when to call the CPU and when to call the GPU blas.

There are even better performing alternatives, like this one, but the implementation is more complex.

Ok, I’ve done it with thunking approach (first step for users !). I’m quite sure that the CuBlas commands are well called in fortran.c program.

But, on the tests, it fails :

N	  :   14400 

NB	 :	 768 

PMAP   : Column-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :	Left	Crout	Right 

NBMIN  :	   4 

NDIV   :	   2 

RFACT  :	Left	Crout	Right 

BCAST  :   1ring 

DEPTH  :	   0 

SWAP   : Mix (threshold = 64)

L1	 : no-transposed form

U	  : no-transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

	  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be			   1.110223e-16

- Computational tests pass if scaled residuals are less than				16.0

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC00L2L4	   14400   768	 1	 1			  86.60			  2.299e+01

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=			  nan ...... FAILED

For smaller dimensions, the output are rather different:

N	  :	1890 

NB	 :	 768 

PMAP   : Column-major process mapping

P	  :	   1 

Q	  :	   1 

PFACT  :	Left	Crout	Right 

NBMIN  :	   4 

NDIV   :	   2 

RFACT  :	Left	Crout	Right 

BCAST  :   1ring 

DEPTH  :	   0 

SWAP   : Mix (threshold = 64)

L1	 : no-transposed form

U	  : no-transposed form

EQUIL  : yes

ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

	  ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be			   1.110223e-16

- Computational tests pass if scaled residuals are less than				16.0

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC00L2L4		1890   768	 1	 1			   3.63			  1.240e+00

HPL ERROR from process # 0, on line 321 of function HPL_pdtest:

>>> Error code returned by solve is 1875, skip <<<

Any idea about the problems I should look at ?

Cheers

cublas is a fortran ordered blas, so you might need to check whether you are telling HPL is expect row or column ordered storage in your build and runtime flags.

#ifdef HPL_CALL_CUBLAS

char transa,transb;

if (TRANSA==HplNoTrans)

	transa=CudaNoTrans;

  else if (TRANSA==HplTrans)

	transa=CudaTrans;

  else

	transa=CudaConjTrans;

if (TRANSB==HplNoTrans)

	transb=CudaNoTrans;

  else if (TRANSB==HplTrans)

	transb=CudaTrans;

  else

	transb=CudaConjTrans;

if( ORDER == HplColumnMajor )

	{

	  CUBLAS_DGEMM( &transa, &transb, &M, &N, &K, &ALPHA, 

			A, &LDA, B, &LDB, &BETA, C, &LDC);  

	}

  else

	{

	 CUBLAS_DGEMM( &transb, &transa, &N, &M, &K, &ALPHA, 

		   B, &LDB, A, &LDA, &BETA, C, &LDC );

	}

Like this, I suppose… Done since 3 days, for all HPL calls ! Anything else ?

A strange thing during the tests : when Ns=NBs, no problem for FBLAS, but, on CuBLAS :

** On entry to DTRSV  parameter number 4 had an illegal value

============================================================

====================

T/V				N	NB	 P	 Q			   Time				 Gflops

--------------------------------------------------------------------------------

WC00L2L4		2560  2560	 1	 1			   8.23			  1.360e+00

--------------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=			  nan ...... FAILED

This message “** On entry to DTRSV parameter number 4” came from cublas.so library, at the end of

/home/buildmeister/build/rel/gpgpu/toolkit/r3.0/cublas/src/cgemm.cu

The 4th parameter seems to be ok… Any idea ?

Cheers

That DGEMM logic looks OK, it should work, that isn’t all that different to the way we have modified HPL_dgemm to use our own DGEMM routines.

As for the DTRSV message, I am really not sure that the factorization algorithm used by HPL will work for NB=N (I certainly have never tried it). I also have not used the thunking interface, so I don’t know whether it is doing anything funny like data transposition on its own.