Why cublas is much slower than Matlab runs on CPU

soyvialyh · January 25, 2011, 4:33pm

I wrote two mex functions in Matlab to test two functions in cublas: cublasCgemm and cublasCherk. The timing result is that they both run much slower than computation in Matlab running on CPU!
Even when I used 2048*2048 size complex matrix, cublas function is nearly 30 times slower than CPU.

My GPU card is Nvidia Geforce 9400,
CUDA version 3.2

Matlab version R2010b,
CPU processor is 2.53 GHz Inter Core 2 Duo, Memory is 4GB 1067MHz DDR3

What is the reason? I feel puzzled.

Skribtsov · January 25, 2011, 5:47pm

a) try running the test with a regular, floating point matrix (do not use double precision with GeForce 9400)
b) 9400 is a pretty low-end card. It’s computational bandwidth is probably lower than that of your CPU and if Matlab uses Intel MKL (or other performance libraries) and makes use of SSE and multi-threading, they can beat 9400
c) 9400 has low CPU<->GPU transfer rate, so you can be loosing time on transferring data to and from
d) can be a software bug in your testing code (e.g. you process much more data than you expect, or something like confusing sparse matrices, etc.)

30 times though seems to be rather bad ratio anyway! Even for 9400… Maybe cublas code is just not optimal for complex matrices or maybe it’s just not optimal for 9400 device…

soyvialyh · January 25, 2011, 6:15pm

Thank you for your quick reply. All the reasons you gave to me are reasonable.

Yes, I used single precision matrix by generating randomly just like this:

A= single(rand(N,N) + i*rand(N,N));

B = single(rand(N,N) + i*rand(N,N));

So, no sparse matrix here.

Maybe I can try my code on different GPU card and see the performance.

Skribtsov · January 25, 2011, 6:25pm

30 times still sounds too much… are you sure you built your CUDA app in Release mode (and not software emulation mode)?

Can you run NVIDIA’s example on matrix multiplication (from SDK) - record the timing. Then try to mimic the same operation in Matlab and compare the execution times.

soyvialyh · January 26, 2011, 2:56pm

I tried the SDK code: matrixMul and do the same computation in Matlab, the time on GPU is 0.00004 while on CPU is 0.0001. GPU makes 2.5X speed up.

But that code does not use cublas. I think cublas should be faster than it.

Why I got so bad result using my own code? I checked it for several times and couldnot find the bug.

Skribtsov · January 26, 2011, 4:17pm

CUBLAS should not be that slower as the SDK example. So there should be probably an issue somewhere in the code. I can ask our developer to look at your code. As I can imagine it is something really simple, so can you post it here?

soyvialyh · January 28, 2011, 3:08pm

Thank you so much, I am very glad to post my code here and let someone help me find the error. I am new in CUDA so please help me.

Here is the mex file for cublasCgemm function, which computes multiplication of two complex matrixes.

#include <stdio.h>

#include <stdlib.h>

#include "mex.h" 

#include "cublas.h"

void pack_r2c(cuComplex  *output_float, 

              float *input_re, 

              int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

	      {

               output_float[i].x = input_re[i];

               output_float[i].y = 0.0f;

	      }

}

void pack_c2c(cuComplex  *output_float, 

              float *input_re, 

              float *input_im, 

              int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

	     {

               output_float[i].x = input_re[i];

               output_float[i].y = input_im[i];

	     }

}

void unpack_c2c(cuComplex  *input_float, 

                float *output_re, 

                float *output_im,  

                int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

    {

               output_re[i] = input_float[i].x;

               output_im[i] = input_float[i].y;

    }

}

void mexFunction(int nlhs, mxArray *plhs[],int nrhs,const mxArray *prhs[])

{

    cuComplex* h_A;      // h_A is in host memory 

    cuComplex* h_B;

    cuComplex* h_C;

    cuComplex* d_A = 0;   // d_A is in device memory

    cuComplex* d_B = 0;

    cuComplex* d_C = 0;

    int RowsA,RowsB,RowsC,ColsA,ColsB,ColsC;

    cuComplex alpha = {1,0};

    cuComplex beta = {0,0};

    char transa, transb;

    int lda,ldb,ldc;

    float *ar,*ai;     

    //check the numbers of input and output arguments

    if (nrhs!=2){

        mexErrMsgTxt("FimageMul_CUDA requires 2 input arguments: matrix A, B ");

    }else if (nlhs!=1){

mexErrMsgTxt("FimagMul_CUDA requires 1 output arguments");

    }

transa = 'n';

    transb = 'n';

    RowsA = mxGetM(prhs[0]);

    ColsA = mxGetN(prhs[0]); 

    RowsB = mxGetM(prhs[1]);

    ColsB = mxGetN(prhs[1]); 

if  (ColsA != RowsB){

        mexErrMsgTxt("the second image's rows(height) should= the first image's columns(width)");

        return;

    }

    RowsC = RowsA;

    ColsC = ColsB;        

// get the input matrix A

    /* Allocating working array on host */

    h_A  = (cuComplex*) mxCalloc(ColsA*RowsA,sizeof(cuComplex));

    /* Pointer for the real part of the input matrix A*/

    ar =  (float *) mxGetData(prhs[0]);

if(mxIsComplex(prhs[0])) 

    {

        /* The input array is complex */

        ai =  (float *) mxGetImagData(prhs[0]); 

        pack_c2c(h_A, ar, ai, ColsA*RowsA); 

    }

    else

    {

        /* The input array is real */

        pack_r2c(h_A, ar, ColsA*RowsA); 

    }

// get the input Array B

    /* Allocating working array on host */

    h_B  = (cuComplex*) mxCalloc(ColsB*RowsB,sizeof(cuComplex));

    /* Pointer for the real part of the input Array B*/

    ar =  (float *) mxGetData(prhs[1]);

    if(mxIsComplex(prhs[1])) 

    {

        /* The input array is complex */

        ai =  (float *) mxGetImagData(prhs[1]); 

        pack_c2c(h_B, ar, ai, ColsB*RowsB); 

    }

    else

    {

        /* The input array is real */

        pack_r2c(h_B, ar, ColsB*RowsB); 

    }

lda = max(1, RowsA);    /* leading dimension of A*/

    ldb = max(1, RowsB);   /* leading dimension of B */

    ldc = max(1, RowsA);   /*leading dimension of C*/

int mA_lda = RowsA;

    int mA_ldb = RowsA;

int mB_lda = RowsB;

    int mB_ldb = RowsB;

int mC_lda = RowsC;

    int mC_ldb = RowsC;

/***** Host memory allocated for result*****/

    h_C = (cuComplex*) mxCalloc(ColsC*RowsC,sizeof(cuComplex));

/* Initialize CUBLAS */

    cublasInit();

     /* Allocate device memory for the matrices */

    cublasAlloc(RowsA * ColsA, sizeof(cuComplex), (void**)&d_A);

    cublasAlloc(RowsB * ColsB, sizeof(cuComplex), (void**)&d_B);

    cublasAlloc(RowsC * ColsC, sizeof(cuComplex), (void**)&d_C);

/* Initialize the device matrices with the host matrices */

    cublasStatus cpA = cublasSetMatrix(RowsA, ColsA, sizeof(cuComplex), h_A, mA_lda, d_A, mA_ldb);

    cublasStatus cpB = cublasSetMatrix(RowsB, ColsB, sizeof(cuComplex), h_B, mB_lda, d_B, mB_ldb);

    cublasStatus cpC = cublasSetMatrix(RowsC, ColsC, sizeof(cuComplex), h_C, mC_lda, d_C, mC_ldb);

cublasCgemm(transa,transb,RowsA,ColsB,ColsA,alpha,d_A,lda,d_B,ldb,beta,d_C,ldc);

    cublasStatus cgRes = cublasGetError();    

    if (cgRes ==CUBLAS_STATUS_SUCCESS)          

    mexPrintf("cublasCgemm  succeeds\n");

else        

    mexPrintf("!!!! cublasCgemm failed! (status = %x)", cgRes);

    //get the result matrix ; copy from GPU tp CPU

    cublasStatus cgm = cublasGetMatrix(RowsC, ColsC, sizeof(cuComplex), d_C, mC_lda, h_C, mC_lda);

    // output setting up

    const mwSize dims[]={RowsC,ColsC};

    plhs[0] = mxCreateNumericArray(2,dims,mxSINGLE_CLASS,mxCOMPLEX);

    ar = (float *)mxGetPr(plhs[0]); 

    ai = (float *)mxGetPi(plhs[0]);

    unpack_c2c(h_C, ar, ai, RowsC*ColsC); 

/* Memory clean up */

    mxFree(h_A);

    mxFree(h_B);

    mxFree(h_C);

cublasFree(d_A);

    cublasFree(d_B);

    cublasFree(d_C);

/* Shutdown */

    cublasShutdown();

return ;

}

Test code is

N = 32; 

i = sqrt(-1); 

A= rand(N,N) + i*rand(N,N);

B = rand(N,N) + i*rand(N,N);

A = single(A);

B = single(B);

disp('running on GPU');

tic

C = FimageMul_CUDA_float(A,B);

toc

disp('running on CPU');

tic

C_mat =  A*B; 

toc

One of the running result is

running on GPU

cublasCgemm succeeds

Elapsed time is 0.000751 seconds.

running on CPU

Elapsed time is 0.000024 seconds.

Am I wrong in the code?

soyvialyh · January 28, 2011, 3:55pm

My mex file of cublasCherk:

void mexFunction(int nlhs, mxArray *plhs[],int nrhs,const mxArray *prhs[])

{

    cuComplex* h_A;      // h_A is in host memory 

    cuComplex* h_C;

cuComplex* d_A = 0;   // d_A is in device memory

    cuComplex* d_C = 0;

int k,n;    // the size of A is k*n 

    float *ar,*ai;        

    //check the numbers of input and output arguments

    if (nrhs!=1){

        mexErrMsgTxt(" requires 1 input arguments: matrix A");

    }else if (nlhs!=1){

mexErrMsgTxt(" requires 1 output arguments");

    }

k = mxGetM(prhs[0]);

    n = mxGetN(prhs[0]); 

// get the input matrix A

/* Allocating working array on host */

    h_A  = (cuComplex*) mxCalloc(k*n,sizeof(cuComplex));

/* Pointer for the real part of the input matrix A*/

    ar =  (float *) mxGetData(prhs[0]);

if(mxIsComplex(prhs[0])) 

    {

        /* The input array is complex */

        ai =  (float *) mxGetImagData(prhs[0]); 

        pack_c2c(h_A, ar, ai, k*n); 

    }

    else

    {

        /* The input array is real */

        pack_r2c(h_A, ar, k*n); 

    }

/***** Host memory allocated for result*****/

    h_C = (cuComplex*) mxCalloc(n*n,sizeof(cuComplex));

    /* Initialize CUBLAS */

    cublasInit();

    // test for error

    cublasStatus retStatus = cublasGetError ();

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

        mexPrintf("CUBLAS: an error occured in cublasInit\n");

      } else {

        mexPrintf("CUBLAS: cublasInit worked\n");

      }

/* Allocate device memory for the matrices */

    cublasAlloc(k * n, sizeof(cuComplex), (void**)&d_A);

    cublasAlloc(n * n, sizeof(cuComplex), (void**)&d_C);    

/* Initialize the device matrices with the host matrices */

    cublasStatus cpA = cublasSetMatrix(k,n, sizeof(cuComplex), h_A, k, d_A, k);

if (cpA ==CUBLAS_STATUS_SUCCESS)          mexPrintf("cublasSetMatrix A succeeds\n");

//    cublasCherk (char uplo, char trans, int n, int k,

//float alpha, const cuComplex *A, int lda, float beta, cuComplex *C, int ldc)   

cublasCherk('u','c',n,k,1,d_A,k,0,d_C,n);

cublasStatus cgRes = cublasGetError(); 

if (cgRes ==CUBLAS_STATUS_SUCCESS)          

    mexPrintf("cublasCherk  succeeds\n");

else        

    mexPrintf("!!!! cublasCherk failed! (status = %x)", cgRes);

//get the result matrix ; copy from GPU to CPU

    cublasStatus cgm = cublasGetMatrix(n, n, sizeof(cuComplex), d_C, n, h_C, n);

const mwSize dims[]={n,n};

plhs[0] = mxCreateNumericArray(2,dims,mxSINGLE_CLASS,mxCOMPLEX);

    ar = (float *)mxGetPr(plhs[0]); 

    ai = (float *)mxGetPi(plhs[0]);

    unpack_c2c(h_C, ar, ai, n*n); 

    /* Memory clean up */

    mxFree(h_A);

    mxFree(h_C);

cublasFree(d_A);

    cublasFree(d_C);

/* Shutdown */

    cublasShutdown();

    return ;

}

Thank you for helping.

Dittoaway · January 28, 2011, 4:10pm

A big chunk of that ‘CUBLAS’ time is probably from printing ‘[font=arial, verdana, tahoma, sans-serif]cublasCgemm succeeds’.[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]I just did a quickie Matlab script on my fast PC (Intel I7)[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]tic[/font]

[font=arial, verdana, tahoma, sans-serif]out = [/font] ‘[font=arial, verdana, tahoma, sans-serif]cublasCgemm succeeds’[/font]

[font=arial, verdana, tahoma, sans-serif]toc[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]and got [/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=“arial, verdana, tahoma, sans-serif”]Elapsed time is 0.000311 seconds.[/font]

soyvialyh · January 28, 2011, 6:15pm

Do you mean that I should delete all the output lines in the script?

But I tried and no improvement.

Still got:

running on GPU

Elapsed time is 0.000897 seconds.

running on CPU

Elapsed time is 0.000024 seconds.

A big chunk of that ‘CUBLAS’ time is probably from printing ‘[font=arial, verdana, tahoma, sans-serif]cublasCgemm succeeds’.[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]I just did a quickie Matlab script on my fast PC (Intel I7)[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]tic[/font]

[font=arial, verdana, tahoma, sans-serif]out = [/font] ‘[font=arial, verdana, tahoma, sans-serif]cublasCgemm succeeds’[/font]

[font=arial, verdana, tahoma, sans-serif]toc[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]and got [/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=“arial, verdana, tahoma, sans-serif”]Elapsed time is 0.000311 seconds.[/font]

Skribtsov · January 30, 2011, 11:46am

Here’s what we concluded after looking at your issue with our developer

(a) right now the time for cublasInit(); and cublasShutdown(); is taken into account for computations. This may take a lot of time for organizing cuda context for the process, so try to measure the exact piece of matrix multiplication (nothing else)

tic()
cublasCgemm(transa,transb,RowsA,ColsB,ColsA,alpha,d_A,lda,d_B,
ldb,beta,d_C,ldc);
toc()

(c) In general we perfrom all computations from C++ application, so we don’t have Matlab-related overhead which could be a reason for why you don’t get acceleration by using CUDA (from Matlab) and why we do :-)

Sorry if we couldn’t be more helpful

soyvialyh · January 31, 2011, 2:55pm

Thank you for your reply!

So, you mean that the slowness does not come from my code bug but come from the cublas library itself.

I will check if I run on a better GPU, whether the result will be more reasonable.

Thank you again.

ceearem · January 31, 2011, 4:37pm

Hi

the slowness is not because cublas is slow, but because you have all kinds of data intialisation in your “GPU” time which you dont have in the CPU code.

So my guess is that almost all the time you are measuring comes from:

(i): cublas Initialisation (i.e. get a GPU Context)

(ii): data allocation on device

(iii): data transfer from host to device

(iv): data transfer from device to host of results

(v) cublas shutdown

All that stuff is not done in the CPU time where you only measure "AB". Furthermore this overhead grows sublinear with the number of elements of your matrix, because some things like cublas initialisation and shutdown dont depend on your matrix size while allocation is also just a single operation. Only the data transfer grows with the number of elements (NN) in your matrix. That means that for a larger matrix this overhead will take less proportional time. And with a 3232 Matrix you are nowhere the region that the multiplication takes longer than all that initialisation. With a 3232 Matrix Multiplication you are probably not even close to the region where your GPU is fully utilized during the Multiplication itself.

Generally the way to use GPUs (and also GPUs in Matlab) is to put your input data up on the GPU, and than start a long sequence of computations and only download actual results. But a single Matrix multiplication is most likely never enough to justify the costs of initialising the GPU and copy data to it. On the other hand I doubt that your goal is to actually accelerate a single matrix multiplication.

I suggest to test the following:

Measure only the time for the Matrix multiplication, so you know what the maximum speed up is if you have so much workload that GPU initialisation time is comparatively low.
Try larger N (256,512,1024,2048,…)
Do more matrix multiplications (for a test Loop over the same multiplication a hundred times).

Cheers

Ceearem

soyvialyh · January 31, 2011, 9:11pm

Thank you for your detailed analysis and I agree with you on the slowness reason.

I will measure only the multiplication time next.

Hi

the slowness is not because cublas is slow, but because you have all kinds of data intialisation in your “GPU” time which you dont have in the CPU code.

So my guess is that almost all the time you are measuring comes from:

(i): cublas Initialisation (i.e. get a GPU Context)

(ii): data allocation on device

(iii): data transfer from host to device

(iv): data transfer from device to host of results

(v) cublas shutdown

All that stuff is not done in the CPU time where you only measure "AB". Furthermore this overhead grows sublinear with the number of elements of your matrix, because some things like cublas initialisation and shutdown dont depend on your matrix size while allocation is also just a single operation. Only the data transfer grows with the number of elements (NN) in your matrix. That means that for a larger matrix this overhead will take less proportional time. And with a 3232 Matrix you are nowhere the region that the multiplication takes longer than all that initialisation. With a 3232 Matrix Multiplication you are probably not even close to the region where your GPU is fully utilized during the Multiplication itself.

Generally the way to use GPUs (and also GPUs in Matlab) is to put your input data up on the GPU, and than start a long sequence of computations and only download actual results. But a single Matrix multiplication is most likely never enough to justify the costs of initialising the GPU and copy data to it. On the other hand I doubt that your goal is to actually accelerate a single matrix multiplication.

I suggest to test the following:

Measure only the time for the Matrix multiplication, so you know what the maximum speed up is if you have so much workload that GPU initialisation time is comparatively low.

Try larger N (256,512,1024,2048,…)

Do more matrix multiplications (for a test Loop over the same multiplication a hundred times).

Cheers

Ceearem

melonakos · February 3, 2011, 9:19pm

Have you tried Jacket? You might be interested in the Jacket SDK too: http://blog.accelereyes.com/blog/2010/10/29/jacket_sdk_trumps_mex/

soyvialyh · February 10, 2011, 7:20pm

I have no idea of Jacket. What is it? Is it a more convenient way to use CUDA in Matlab? And how can I get it? Thank you!

Topic		Replies	Views
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	921	August 23, 2018
Matlab mex file using cublas - problems CUDA Programming and Performance	13	8942	October 13, 2009
Matrix multiplication + reduction with atomicAdd is slower than CPU code, how can I make it faster? CUDA Programming and Performance	17	2792	January 14, 2018
CuBLAS matrix multiplication is slower than the naive one CUDA Programming and Performance cuda	8	699	September 6, 2023
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	9992	March 24, 2014
Help with CUDA+Matlab acceleration CUDA Programming and Performance	24	14488	December 17, 2008
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2127	June 8, 2021
The best input layout settings in CuBlas GPU-Accelerated Libraries cublas	4	96	August 27, 2024
CUBLAS vs GOTOBLAS2 CUDA Programming and Performance	20	10672	September 15, 2010
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	9912	January 20, 2010

Why cublas is much slower than Matlab runs on CPU

Related topics