Why cublas is much slower than Matlab runs on CPU

I wrote two mex functions in Matlab to test two functions in cublas: cublasCgemm and cublasCherk. The timing result is that they both run much slower than computation in Matlab running on CPU!
Even when I used 2048*2048 size complex matrix, cublas function is nearly 30 times slower than CPU.

My GPU card is Nvidia Geforce 9400,
CUDA version 3.2

Matlab version R2010b,
CPU processor is 2.53 GHz Inter Core 2 Duo, Memory is 4GB 1067MHz DDR3

What is the reason? I feel puzzled.

a) try running the test with a regular, floating point matrix (do not use double precision with GeForce 9400)
b) 9400 is a pretty low-end card. It’s computational bandwidth is probably lower than that of your CPU and if Matlab uses Intel MKL (or other performance libraries) and makes use of SSE and multi-threading, they can beat 9400
c) 9400 has low CPU<->GPU transfer rate, so you can be loosing time on transferring data to and from
d) can be a software bug in your testing code (e.g. you process much more data than you expect, or something like confusing sparse matrices, etc.)

30 times though seems to be rather bad ratio anyway! Even for 9400… Maybe cublas code is just not optimal for complex matrices or maybe it’s just not optimal for 9400 device…

Thank you for your quick reply. All the reasons you gave to me are reasonable.

Yes, I used single precision matrix by generating randomly just like this:

A= single(rand(N,N) + i*rand(N,N));

B = single(rand(N,N) + i*rand(N,N));

So, no sparse matrix here.

Maybe I can try my code on different GPU card and see the performance.

30 times still sounds too much… are you sure you built your CUDA app in Release mode (and not software emulation mode)?

Can you run NVIDIA’s example on matrix multiplication (from SDK) - record the timing. Then try to mimic the same operation in Matlab and compare the execution times.

I tried the SDK code: matrixMul and do the same computation in Matlab, the time on GPU is 0.00004 while on CPU is 0.0001. GPU makes 2.5X speed up.

But that code does not use cublas. I think cublas should be faster than it.

Why I got so bad result using my own code? I checked it for several times and couldnot find the bug.

CUBLAS should not be that slower as the SDK example. So there should be probably an issue somewhere in the code. I can ask our developer to look at your code. As I can imagine it is something really simple, so can you post it here?

Thank you so much, I am very glad to post my code here and let someone help me find the error. I am new in CUDA so please help me.

Here is the mex file for cublasCgemm function, which computes multiplication of two complex matrixes.

#include <stdio.h>

#include <stdlib.h>

#include "mex.h" 

#include "cublas.h"

void pack_r2c(cuComplex  *output_float, 

              float *input_re, 

              int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

	      {

               output_float[i].x = input_re[i];

               output_float[i].y = 0.0f;

	      }

}

void pack_c2c(cuComplex  *output_float, 

              float *input_re, 

              float *input_im, 

              int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

	     {

               output_float[i].x = input_re[i];

               output_float[i].y = input_im[i];

	     }

}

void unpack_c2c(cuComplex  *input_float, 

                float *output_re, 

                float *output_im,  

                int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++) 

    {

               output_re[i] = input_float[i].x;

               output_im[i] = input_float[i].y;

    }

}

void mexFunction(int nlhs, mxArray *plhs[],int nrhs,const mxArray *prhs[])

{

    cuComplex* h_A;      // h_A is in host memory 

    cuComplex* h_B;

    cuComplex* h_C;

    cuComplex* d_A = 0;   // d_A is in device memory

    cuComplex* d_B = 0;

    cuComplex* d_C = 0;

    int RowsA,RowsB,RowsC,ColsA,ColsB,ColsC;

    cuComplex alpha = {1,0};

    cuComplex beta = {0,0};

    char transa, transb;

    int lda,ldb,ldc;

    float *ar,*ai;     

    //check the numbers of input and output arguments

    if (nrhs!=2){

        mexErrMsgTxt("FimageMul_CUDA requires 2 input arguments: matrix A, B ");

    }else if (nlhs!=1){

mexErrMsgTxt("FimagMul_CUDA requires 1 output arguments");

    }

transa = 'n';

    transb = 'n';

    RowsA = mxGetM(prhs[0]);

    ColsA = mxGetN(prhs[0]); 

    RowsB = mxGetM(prhs[1]);

    ColsB = mxGetN(prhs[1]); 

if  (ColsA != RowsB){

        mexErrMsgTxt("the second image's rows(height) should= the first image's columns(width)");

        return;

    }

    RowsC = RowsA;

    ColsC = ColsB;        

// get the input matrix A

    /* Allocating working array on host */

    h_A  = (cuComplex*) mxCalloc(ColsA*RowsA,sizeof(cuComplex));

    /* Pointer for the real part of the input matrix A*/

    ar =  (float *) mxGetData(prhs[0]);

if(mxIsComplex(prhs[0])) 

    {

        /* The input array is complex */

        ai =  (float *) mxGetImagData(prhs[0]); 

        pack_c2c(h_A, ar, ai, ColsA*RowsA); 

    }

    else

    {

        /* The input array is real */

        pack_r2c(h_A, ar, ColsA*RowsA); 

    }

// get the input Array B

    /* Allocating working array on host */

    h_B  = (cuComplex*) mxCalloc(ColsB*RowsB,sizeof(cuComplex));

    /* Pointer for the real part of the input Array B*/

    ar =  (float *) mxGetData(prhs[1]);

    if(mxIsComplex(prhs[1])) 

    {

        /* The input array is complex */

        ai =  (float *) mxGetImagData(prhs[1]); 

        pack_c2c(h_B, ar, ai, ColsB*RowsB); 

    }

    else

    {

        /* The input array is real */

        pack_r2c(h_B, ar, ColsB*RowsB); 

    }

lda = max(1, RowsA);    /* leading dimension of A*/

    ldb = max(1, RowsB);   /* leading dimension of B */

    ldc = max(1, RowsA);   /*leading dimension of C*/

int mA_lda = RowsA;

    int mA_ldb = RowsA;

int mB_lda = RowsB;

    int mB_ldb = RowsB;

int mC_lda = RowsC;

    int mC_ldb = RowsC;

/***** Host memory allocated for result*****/

    h_C = (cuComplex*) mxCalloc(ColsC*RowsC,sizeof(cuComplex));

/* Initialize CUBLAS */

    cublasInit();

     /* Allocate device memory for the matrices */

    cublasAlloc(RowsA * ColsA, sizeof(cuComplex), (void**)&d_A);

    cublasAlloc(RowsB * ColsB, sizeof(cuComplex), (void**)&d_B);

    cublasAlloc(RowsC * ColsC, sizeof(cuComplex), (void**)&d_C);

/* Initialize the device matrices with the host matrices */

    cublasStatus cpA = cublasSetMatrix(RowsA, ColsA, sizeof(cuComplex), h_A, mA_lda, d_A, mA_ldb);

    cublasStatus cpB = cublasSetMatrix(RowsB, ColsB, sizeof(cuComplex), h_B, mB_lda, d_B, mB_ldb);

    cublasStatus cpC = cublasSetMatrix(RowsC, ColsC, sizeof(cuComplex), h_C, mC_lda, d_C, mC_ldb);

cublasCgemm(transa,transb,RowsA,ColsB,ColsA,alpha,d_A,lda,d_B,ldb,beta,d_C,ldc);

    cublasStatus cgRes = cublasGetError();    

    if (cgRes ==CUBLAS_STATUS_SUCCESS)          

    mexPrintf("cublasCgemm  succeeds\n");

else        

    mexPrintf("!!!! cublasCgemm failed! (status = %x)", cgRes);

    //get the result matrix ; copy from GPU tp CPU

    cublasStatus cgm = cublasGetMatrix(RowsC, ColsC, sizeof(cuComplex), d_C, mC_lda, h_C, mC_lda);

    // output setting up

    const mwSize dims[]={RowsC,ColsC};

    plhs[0] = mxCreateNumericArray(2,dims,mxSINGLE_CLASS,mxCOMPLEX);

    ar = (float *)mxGetPr(plhs[0]); 

    ai = (float *)mxGetPi(plhs[0]);

    unpack_c2c(h_C, ar, ai, RowsC*ColsC); 

/* Memory clean up */

    mxFree(h_A);

    mxFree(h_B);

    mxFree(h_C);

cublasFree(d_A);

    cublasFree(d_B);

    cublasFree(d_C);

/* Shutdown */

    cublasShutdown();

return ;

}

Test code is

N = 32; 

i = sqrt(-1); 

A= rand(N,N) + i*rand(N,N);

B = rand(N,N) + i*rand(N,N);

A = single(A);

B = single(B);

disp('running on GPU');

tic

C = FimageMul_CUDA_float(A,B);

toc

disp('running on CPU');

tic

C_mat =  A*B; 

toc

One of the running result is

running on GPU

cublasCgemm succeeds

Elapsed time is 0.000751 seconds.

running on CPU

Elapsed time is 0.000024 seconds.

Am I wrong in the code?

My mex file of cublasCherk:

void mexFunction(int nlhs, mxArray *plhs[],int nrhs,const mxArray *prhs[])

{

    cuComplex* h_A;      // h_A is in host memory 

    cuComplex* h_C;

cuComplex* d_A = 0;   // d_A is in device memory

    cuComplex* d_C = 0;

int k,n;    // the size of A is k*n 

    float *ar,*ai;        

    //check the numbers of input and output arguments

    if (nrhs!=1){

        mexErrMsgTxt(" requires 1 input arguments: matrix A");

    }else if (nlhs!=1){

mexErrMsgTxt(" requires 1 output arguments");

    }

k = mxGetM(prhs[0]);

    n = mxGetN(prhs[0]); 

// get the input matrix A

/* Allocating working array on host */

    h_A  = (cuComplex*) mxCalloc(k*n,sizeof(cuComplex));

/* Pointer for the real part of the input matrix A*/

    ar =  (float *) mxGetData(prhs[0]);

if(mxIsComplex(prhs[0])) 

    {

        /* The input array is complex */

        ai =  (float *) mxGetImagData(prhs[0]); 

        pack_c2c(h_A, ar, ai, k*n); 

    }

    else

    {

        /* The input array is real */

        pack_r2c(h_A, ar, k*n); 

    }

/***** Host memory allocated for result*****/

    h_C = (cuComplex*) mxCalloc(n*n,sizeof(cuComplex));

    /* Initialize CUBLAS */

    cublasInit();

    // test for error

    cublasStatus retStatus = cublasGetError ();

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

        mexPrintf("CUBLAS: an error occured in cublasInit\n");

      } else {

        mexPrintf("CUBLAS: cublasInit worked\n");

      }

/* Allocate device memory for the matrices */

    cublasAlloc(k * n, sizeof(cuComplex), (void**)&d_A);

    cublasAlloc(n * n, sizeof(cuComplex), (void**)&d_C);    

/* Initialize the device matrices with the host matrices */

    cublasStatus cpA = cublasSetMatrix(k,n, sizeof(cuComplex), h_A, k, d_A, k);

if (cpA ==CUBLAS_STATUS_SUCCESS)          mexPrintf("cublasSetMatrix A succeeds\n");

//    cublasCherk (char uplo, char trans, int n, int k,

//float alpha, const cuComplex *A, int lda, float beta, cuComplex *C, int ldc)   

cublasCherk('u','c',n,k,1,d_A,k,0,d_C,n);

cublasStatus cgRes = cublasGetError(); 

if (cgRes ==CUBLAS_STATUS_SUCCESS)          

    mexPrintf("cublasCherk  succeeds\n");

else        

    mexPrintf("!!!! cublasCherk failed! (status = %x)", cgRes);

//get the result matrix ; copy from GPU to CPU

    cublasStatus cgm = cublasGetMatrix(n, n, sizeof(cuComplex), d_C, n, h_C, n);

const mwSize dims[]={n,n};

plhs[0] = mxCreateNumericArray(2,dims,mxSINGLE_CLASS,mxCOMPLEX);

    ar = (float *)mxGetPr(plhs[0]); 

    ai = (float *)mxGetPi(plhs[0]);

    unpack_c2c(h_C, ar, ai, n*n); 

    /* Memory clean up */

    mxFree(h_A);

    mxFree(h_C);

cublasFree(d_A);

    cublasFree(d_C);

/* Shutdown */

    cublasShutdown();

    return ;

}

Thank you for helping.

A big chunk of that ‘CUBLAS’ time is probably from printing ‘[font=arial, verdana, tahoma, sans-serif]cublasCgemm succeeds’.[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]I just did a quickie Matlab script on my fast PC (Intel I7)[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]tic[/font]

[font=arial, verdana, tahoma, sans-serif]out = [/font] ‘[font=arial, verdana, tahoma, sans-serif]cublasCgemm succeeds’[/font]

[font=arial, verdana, tahoma, sans-serif]toc[/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=arial, verdana, tahoma, sans-serif]and got [/font]

[font=arial, verdana, tahoma, sans-serif]

[/font]

[font=“arial, verdana, tahoma, sans-serif”]Elapsed time is 0.000311 seconds.[/font]

Do you mean that I should delete all the output lines in the script?

But I tried and no improvement.

Still got:

running on GPU

Elapsed time is 0.000897 seconds.

running on CPU

Elapsed time is 0.000024 seconds.

Here’s what we concluded after looking at your issue with our developer

(a) right now the time for cublasInit(); and cublasShutdown(); is taken into account for computations. This may take a lot of time for organizing cuda context for the process, so try to measure the exact piece of matrix multiplication (nothing else)

tic()
cublasCgemm(transa,transb,RowsA,ColsB,ColsA,alpha,d_A,lda,d_B,
ldb,beta,d_C,ldc);
toc()

(c) In general we perfrom all computations from C++ application, so we don’t have Matlab-related overhead which could be a reason for why you don’t get acceleration by using CUDA (from Matlab) and why we do :-)

Sorry if we couldn’t be more helpful

Thank you for your reply!

So, you mean that the slowness does not come from my code bug but come from the cublas library itself.

I will check if I run on a better GPU, whether the result will be more reasonable.

Thank you again.

Hi

the slowness is not because cublas is slow, but because you have all kinds of data intialisation in your “GPU” time which you dont have in the CPU code.

So my guess is that almost all the time you are measuring comes from:

(i): cublas Initialisation (i.e. get a GPU Context)

(ii): data allocation on device

(iii): data transfer from host to device

(iv): data transfer from device to host of results

(v) cublas shutdown

All that stuff is not done in the CPU time where you only measure "AB". Furthermore this overhead grows sublinear with the number of elements of your matrix, because some things like cublas initialisation and shutdown dont depend on your matrix size while allocation is also just a single operation. Only the data transfer grows with the number of elements (NN) in your matrix. That means that for a larger matrix this overhead will take less proportional time. And with a 3232 Matrix you are nowhere the region that the multiplication takes longer than all that initialisation. With a 3232 Matrix Multiplication you are probably not even close to the region where your GPU is fully utilized during the Multiplication itself.

Generally the way to use GPUs (and also GPUs in Matlab) is to put your input data up on the GPU, and than start a long sequence of computations and only download actual results. But a single Matrix multiplication is most likely never enough to justify the costs of initialising the GPU and copy data to it. On the other hand I doubt that your goal is to actually accelerate a single matrix multiplication.

I suggest to test the following:

  1. Measure only the time for the Matrix multiplication, so you know what the maximum speed up is if you have so much workload that GPU initialisation time is comparatively low.

  2. Try larger N (256,512,1024,2048,…)

  3. Do more matrix multiplications (for a test Loop over the same multiplication a hundred times).

Cheers

Ceearem

Thank you for your detailed analysis and I agree with you on the slowness reason.

I will measure only the multiplication time next.

Have you tried Jacket? You might be interested in the Jacket SDK too: http://blog.accelereyes.com/blog/2010/10/29/jacket_sdk_trumps_mex/

I have no idea of Jacket. What is it? Is it a more convenient way to use CUDA in Matlab? And how can I get it? Thank you!