CPU+GPU dgemm

hi…

I am trying a simple double precision matrix matrix multiplication using CPU+GPU concept , for CPU I am using cblas_dgemm(…) linked with Intel MKL 10.0.3 and on GPU using CUBLAS with cuda driver 3.2 . I am using the Intel Xeon E5420 (Dual socket Quad Core) @ 2.5 Ghz clock speed having the Peak 80 Gflops , with the Tesla C1060 , PCIe x16 gen2 slot . Here are the some results which I am getting

  1. GPU only --> CUBLAS --> Each Matrix size 12288 * 12288 —> 71.1 GFLOPs sustain ( for double precision)
  2. CPU+GPU dgemm —> CUBLAS + CBLAS —> Each Matrix size 12288 * 12288 —> 142.8 GFLOPS sustain( for double precision , by diving the Matrix B equally between the CPU & GPU)

I am considering total doble precision peak for CPU+GPU is = 80 + 78 = 158 GFLOPS
I am getting the sustain as = 142.8 GFlops

My query is
—> Are the results are acceptable?
—> How to decide whether the obtained sustain performance is correct or not ?

Those results seem reasonably close to my attempts to do the same: 85-90% overall computational efficiency is about correct for a problem of the size of yours.

Hi…

I am trying a simple double precision matrix matrix multiplication using CPU+GPU Cluster concept ,
for CPU I am using cblas_dgemm(…) linked with lapack and on GPU using CUBLAS with cuda driver 3.2 .
I am using the Intel Core2Duo E5400 @ 2.7 Ghz , with the GTX 460 , PCIe x16 slot .
Here are the some results which I am getting.
1. CPU Only --> Each Matrix size 2867228672 --> 33 GFLOPs sustain ( for double precision)
2. GPU Only --> Each Matrix size 28672
28672 --> 59 GFLOPs sustain ( for double precision)
3. CPU + GPU DGEMM !
I wanna know how to do CPU + GPU DGEMM.
Thank you

How are you calculating a (28672 x 28672) * (28672 x 28672) dense matrix multiply on a GTX460? That is 21Gb worth of matrices in double precision. On a device with a maximum of 1Gb of memory. If you really are doing the computation at the size on the GPU, then you already have the answer of how to solve a single matrix multiply with multiple gemm calls, and extension to use both CPU and GPU from what you already have should be trivial.

Also, my experience with Core2 Duo E6750 processors (3.0GHz, 4Gb L2 cache, 1333 MHz FSB giving 24 double precision GFLOP/s peak, ie. considerably faster that your E5400) is that they give a maximum of about 21 GFLOP/s using multithreaded GotoBlas or the Intel MKL dgemm. Both of those dgemm kernels are much, much faster than the standard C BLAS. Then there is the memory problem again. Most socket 775 chipsets can only address 8Gb of memory. So how are you fitting 21Gb of double precision matrices into a machine with less than 8Gb of memory? 33 GFLOP/s dgemm on the CPU hardware you have, with the code you are running, and at the matrix size you are claiming seems impossible.

So I doubt the veracity of your results from both CPU and GPU on a number of issues.

But to answer your question, if you want to see how to do what you are interested in, I recommend this paper.

I have been trying to follow the paper tag link that you have already.
It seems that the result will come out the same.
I do not know what to do wrong or not.
file HPL_dgemm.cand File HPL_dgemm-gpu-cpu-N1N2.c

Call the host BLAS on the residual part of the LHS matrix after launching the GPU BLAS, then join the two result matrices back together again afterwards. CUBLAS is asynchronous, the host will wait at the next memcpy if the GPU isn’t finished yet.

You might want to rethink memory management, allocating and freeing memory at each dgemm call will be terribly slow.

I would greatly appreciate it if you could explain the results you posted earlier: how were they obtained, and confirm the matrix sizes and hardware you are using for these experiments…

Result CPU only

================================================================================
HPLinpack 2.0 – High-Performance Linpack benchmark – September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 20480 25600 26624 27648 28672 29696
NB : 256 512 768 1024
PMAP : Column-major process mapping
P : 2 4
Q : 4 2
PFACT : Left
NBMIN : 4
NDIV : 2
RFACT : Left Crout Right
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.
  • The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
  • The relative machine precision (eps) is taken to be 1.110223e-16
  • Computational tests pass if scaled residuals are less than 16.0

================================================================================
T/V N NB P Q Time Gflops

WC00R2L4 28672 256 4 2 470.88 3.337e+01
WC00R2L4 29696 256 4 2 518.71 3.366e+01

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0046130 … PASSED

Finished 192 tests with the following results:
192 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.

End of Tests.

Result GPU

HPLinpack 2.0 – High-Performance Linpack benchmark – September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 20480 24064 28672
NB : 256 512 768
PMAP : Column-major process mapping
P : 4 2
Q : 2 4
PFACT : Left Crout Right
NBMIN : 4
NDIV : 2
RFACT : Left Crout Right
BCAST : 1ring
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : no-transposed form
U : no-transposed form
EQUIL : yes
ALIGN : 8 double precision words


  • The matrix A is randomly generated for each test.
  • The following scaled residual check will be computed:
    ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
  • The relative machine precision (eps) is taken to be 1.110223e-16
  • Computational tests pass if scaled residuals are less than 16.0

================================================================================
T/V N NB P Q Time Gflops

WC00L2C4 28672 768 2 4 269.38 5.834e+01
WC00L2R4 28672 768 2 4 263.82 5.957e+01
WC00C2L4 28672 768 2 4 265.25 5.925e+01
WC00C2C4 28672 768 2 4 266.42 5.899e+01

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0037551 … PASSED

Finished 144 tests with the following results:
144 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.

End of Tests.

So those results were HPL Linpack results (ie. a dense LU factorization), not BLAS dgemm results at all? And the matrix sizes are the size of the factorization not the matrix passed to dgemm? And you used 8 MPI processes (so I guess 4 dual core machines) to get them? In other words they had absolutely nothing to do with what this thread is actually about?

We are doing double precision Matrix Matrix Multiplication on Tesla C2050 using CUBLAS 3.2
and the same computation on CPU using CBLAS ( Intel Xeon X5450 Dual Socket Quad Core System).
for comparing the results of GPU against the CPU. But we are getting the divergence in the GPU
results against the CPU.

We could see only correct results till ten decimal points. and after that there is deviation.

Could you please help us to decided the computation over the divergence

We are using the drand48() function to generate the Input Matrices A,B.