Matlab mex file using cublas - problems

I am getting my feet wet with Cuda in attempting to write a mex file that will use the cublasSgemm routine to multiply two matrices. I have Suse linux 10.3 and Cuda 1.1, manipulated to work on the system (all examples compile and work, the Bretherton’s matlab fluid/fft example compiles and works). I don’t really know quite what I am doing; I stumble around quite a bit, but it would be nice to get the routine below working. It compiles o.k., and runs in matlab ok without crashing (a miracle), but it reports back the wrong answer - I suspect array dimensioning problems. If called from matlab as C=sgemm_cu(A,B), the answer for C comes back with the correct dimensions but filled with elements of B, and the A array has arbitrary elements (it’s supposed to be unchanged). I modified the matlab-example Makefile to work for this routine. I am testing it on small 3X3 matrices, but the intent is to use it on 4000X4000 sized matrices or so. I am stuck - and so ask for help.

I note that a script like this would be a simpler (perhaps more useful?) matlab-mex example than the impressive fluid mechanics example.

Here is the code; I’d be thrilled to receive hints that would get it working!

ERRONEOUS CODE DELETED...

A bit of progress… I believe I needed to allocate the single precision a,b,c arrays. The latter part of the code perhaps should read:

MORE ERRONEOUS CODE DELETED

Now A returns unchanged as it should, but the answer for C is all zeros. hurmrumf…

Ahhhhh…success! I google-stumbled across some important clues. The code below works! Yay!

Very illuminating.

#include "mex.h"

#include "cublas.h"

/* Some routines for converting between single and double precision. */

void convert_double2float( double *input_double, float *output_float,int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++)

    {

                output_float[i] = (float) input_double[i];

    }

}

void convert_float2double( float *input_float, double *output_double,int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++)

    {

    /*  printf("input=%f\n",input_float[i]);   */

               output_double[i] = (double) input_float[i];

    /*  printf("ouput=%f\n",output_double[i]);  */

    }

}

/*  sgemm_cu.cu - Gateway function for subroutine sgemm

  function [C]=sgemm_cu(A,B)

*/

void mexFunction( int nlhs, mxArray *plhs[],

    int nrhs, const mxArray *prhs[])

{

      int M,K,L,N;

      int dims0[2];

      float alpha,beta;

      double *A,*B,*C;

      float *a,*b,*c;

      float *ga,*gb,*gc;

      cublasStatus retStatus;

     if (nrhs != 2) {

          mexErrMsgTxt("sgemm requires 2 input arguments");

      } else if (nlhs != 1) {

          mexErrMsgTxt("sgemm requires 1 output argument");

      }

      

      alpha = 1.0;

      beta = 0.0;

   

  /*    printf("alpha=%f\n",alpha); 

      printf("beta=%f\n",beta);  */

     M = mxGetM(prhs[0]);   /* gets number of rows of A */

      K = mxGetN(prhs[0]);   /* gets number of columns of A */

      L = mxGetM(prhs[1]);   /* gets number of rows of B */

      N = mxGetN(prhs[1]);   /* gets number of columns of B */

      /* K and L should be the same... */

/* Left hand side matrix set up */

      dims0[0]=M;

      dims0[1]=N;

      plhs[0] = mxCreateNumericArray(2,dims0,mxDOUBLE_CLASS,mxREAL);

      C = mxGetPr(plhs[0]);

      

/* Matrix 1 */

      A = mxGetPr(prhs[0]);

/* Matrix 2 */

      B = mxGetPr(prhs[1]);

/* Allocating working array on host */

      a  = (float*) mxMalloc(sizeof(float)*M*K);

      b  = (float*) mxMalloc(sizeof(float)*N*K);

     convert_double2float(A,a,M*K);

      convert_double2float(B,b,N*K);

     c  = (float*) mxMalloc(sizeof(float)*M*N);

 /* STARTUP   CUBLAS */

      retStatus = cublasInit();

     // test for error

     retStatus = cublasGetError ();

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

        printf("CUBLAS: an error occured in cublasInit\n");

      } else {

        printf("CUBLAS: cublasInit worked\n");

      }

  /* ALLOCATE SPACE ON THE GPU AND COPY a INTO IT */

      cublasAlloc (M*K, sizeof(float), (void**)&ga);

      // test for error

      retStatus = cublasGetError ();

      if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occured in cublasAlloc\n");

      } else {

      printf("CUBLAS: cublasAlloc worked\n");

      }

      retStatus = cublasSetMatrix (M, K, sizeof(float),

               a, M, (void*)ga, M);

  /* SAME FOR B, C */

      cublasAlloc (N*K, sizeof(float), (void**)&gb);

      retStatus = cublasSetMatrix (K, N, sizeof(float),

               b, K, (void*)gb, K);

     cublasAlloc (N*M, sizeof(float), (void**)&gc);

      retStatus = cublasSetMatrix (M, N, sizeof(float),

               c, M, (void*)gc, M);

  /* READY TO CALL SGEMM */

      (void) cublasSgemm ('n','n',M,N,K,alpha,ga,M,gb,L,beta,gc,M); 

/* NOW COPY THE RESULTING gc ON THE GPU TO THE LOCAL c */

     retStatus = cublasGetMatrix (M, N, sizeof(float),  gc, M, c, M);

/* FREE UP GPU MEMORY AND SHUTDOWN (OPTIONAL?) */

      cublasFree (ga);

      cublasFree (gb);

      cublasFree (gc);

      cublasShutdown();  

     convert_float2double(c,C,M*N);

}

And just to finish up this thread, here is a small matlab script I used to test the code, and the results:

A=randn(3000,2000);

B=randn(2000,3000);

disp('Matlab:')

tic

C1=A*B;

toc

tic

A1=single(A);

B1=single(B);

C1s=A1*B1;

toc

disp('Cuda:')

tic

C2=sgemm_cu(A,B);

toc

RESULTS

Matlab:

Elapsed time is 1.990617 seconds.

Elapsed time is 1.014878 seconds. <= in single precision, matlab is 2X faster!

Cuda:

CUBLAS: cublasInit worked

CUBLAS: cublasAlloc worked

Elapsed time is 0.654078 seconds. <= a modest improvement in computation time, in this case.

I am using an AMD Phenom 9600 cpu and an 8800GT 512MB, for the record.

One more addition to this thread… Below is a mex program for calling generalized sgemm. It can be called from matlab with: C=sgemm_cu(transa,transb,alpha,beta,A,B,C) and it will calculate C=alphaAB + beta*C, with A or B transposed according to transa or transb. I’ve chosen to use integer flags for these, so 0=no transpose, 1 (or anything else) = transpose.

At the moment, this mex file is offering me better than 20% improvement over ordinary matlab in my particular application which employs ca. 3000X3000 matrices. I need to go to greater dimensions soon, however so I expect this mex file will be well worthwhile - if single precision is sufficient for the calculation.

I post this, thinking at least some people will find it useful.

#include "mex.h"

#include "cublas.h"

void convert_double2float( double *input_double, float *output_float,int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++)

    {

                output_float[i] = (float) input_double[i];

    }

}

void convert_float2double( float *input_float, double *output_double,int Ntot)

{

    int i;

    for (i = 0; i < Ntot; i++)

    {

    /*  printf("input=%f\n",input_float[i]);   */

               output_double[i] = (double) input_float[i];

    /*  printf("ouput=%f\n",output_double[i]);  */

    }

}

/*  sgemm_cu.cu - Gateway function for subroutine sgemm

  function [C]=sgemm_cu(transa,transb,alpha,beta,A,B,C)

  transa,transb = 0/1 for no transpose/transpose of A,B

*/

void mexFunction( int nlhs, mxArray *plhs[],

    int nrhs, const mxArray *prhs[])

{

      cublasStatus status;

      int M,K,L,N,MM,NN,KK;

      int dims0[2];

      int ta,tb;

      float alpha,beta;

      double *A,*B,*C,*CC;

      float *a,*b,*c;

      float *ga,*gb,*gc;

      char transa,transb;

      cublasStatus retStatus;

     if (nrhs != 7) {

          mexErrMsgTxt("sgemm requires 7 input arguments");

      } else if (nlhs != 1) {

          mexErrMsgTxt("sgemm requires 1 output argument");

      }

     ta = (int) mxGetScalar(prhs[0]);

      tb = (int) mxGetScalar(prhs[1]);

      alpha = (float) mxGetScalar(prhs[2]);

      beta = (float) mxGetScalar(prhs[3]);

     M = mxGetM(prhs[4]);   /* gets number of rows of A */

      K = mxGetN(prhs[4]);   /* gets number of columns of A */

      L = mxGetM(prhs[5]);   /* gets number of rows of B */

      N = mxGetN(prhs[5]);   /* gets number of columns of B */

     if (ta == 0) {

          transa='n';

          MM=M;

          KK=K;

      } else {

          transa='t';

          MM=K;

          KK=M;

      }

     if (tb == 0) {

          transb='n';

          NN=N;

      } else {

          transb='t';

          NN=L;

      }

/*     printf("transa=%c\n",transa); 

      printf("transb=%c\n",transb); 

      printf("alpha=%f\n",alpha); 

      printf("beta=%f\n",beta);   */

/* Left hand side matrix set up */

      dims0[0]=MM;

      dims0[1]=NN;

      plhs[0] = mxCreateNumericArray(2,dims0,mxDOUBLE_CLASS,mxREAL);

      CC = mxGetPr(plhs[0]);

      

/* Matrix 1 */

      A = mxGetPr(prhs[4]);

/* Matrix 2 */

      B = mxGetPr(prhs[5]);

/* Matrix 3 */

      C = mxGetPr(prhs[6]);

/* Allocating working array on host */

      a  = (float*) mxMalloc(sizeof(float)*M*K);

      b  = (float*) mxMalloc(sizeof(float)*N*L);

      c  = (float*) mxMalloc(sizeof(float)*MM*NN);

     convert_double2float(A,a,M*K);

      convert_double2float(B,b,N*L);

      convert_double2float(C,c,MM*NN);

/* STARTUP   CUBLAS */

      retStatus = cublasInit();

     // test for error

     retStatus = cublasGetError ();

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

        printf("CUBLAS: an error occurred in cublasInit\n");

      } else {

        printf("");

      }

/* ALLOCATE SPACE ON THE GPU AND COPY a INTO IT */

      cublasAlloc (M*K, sizeof(float), (void**)&ga);

      // test for error

      retStatus = cublasGetError ();

      if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occurred in cublasAlloc\n");

      } else {

      printf("");

      }

      retStatus = cublasSetMatrix (M, K, sizeof(float),

               a, M, (void*)ga, M);

/* SAME FOR B, C */

      cublasAlloc (L*N, sizeof(float), (void**)&gb);

      retStatus = cublasSetMatrix (L, N, sizeof(float),

               b, L, (void*)gb, L);

     cublasAlloc (NN*MM, sizeof(float), (void**)&gc);

      retStatus = cublasSetMatrix (MM, NN, sizeof(float),

               c, MM, (void*)gc, MM);

 /*    printf("Op(A) has No. rows = %i\n",MM);

      printf("Op(B) has No. cols = %i\n",NN);

      printf("Op(A) has No. cols = %i\n",KK);

      printf("A has leading dimension = %i\n",M);

      printf("B has leading dimension = %i\n",L);

      printf("C has leading dimension = %i\n",MM); */

/* READY TO CALL SGEMM */

    (void) cublasSgemm (transa,transb,MM,NN,KK,alpha,ga,M,gb,L,beta,gc,MM); 

    status = cublasGetError();

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! kernel execution error.\n");

    }

/* NOW COPY THE RESULTING gc ON THE GPU TO THE LOCAL c */

     retStatus = cublasGetMatrix (MM, NN, sizeof(float),

               gc, MM, c, MM);

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occurred in cublasGetMatrix\n");

     } else { 

     printf("");

     }

/* FREE UP GPU MEMORY AND SHUTDOWN (OPTIONAL?) */

      cublasFree (ga);

      cublasFree (gb);

      cublasFree (gc);

      cublasShutdown();  

     convert_float2double(c,CC,MM*NN);

}

Few comments on your code:

  1. check the input to see if it is double or single. If double you can apply your double2float and vice versa, if single you can pass the array directly to cuda.
    The conversion takes a fair amount of time
  2. try CUDA 2.0, SGEMM is using the code from Volkov that peaks at 205 Gflops on 8800 GTX or Tesla
  3. pad the array on the gpu to be a multiple of 32 or 64 (if using cuda 2.0) to get optimal performance

I am pretty much a newbie here, and stumbing around in C as well, I hesitate to take up your time…But presumably part of your mission here is to bring newbs like me along. So here goes…

Your comment about single/double got me to thinking that it would probably be best to design the code so that the user sends single precision arrays at the get-go. That way, everyone knows that single precision is what is happening. So, calling the routine would be like: C=sgemm_cu(transa,transb,single(alpha),single(beta),single(A),single(B),single©) - if A,B,C were double to begin with. I suspect matlab’s “single” function is considerably faster than the loop. (I ran into double/single array allocation problems with the mex file, however, now that I think about it; matlab seems to expect double precision in the transfer of data to the mex file. Maybe I just don’t know what I’m doing…)

I did see the notes about how CUDA 2.0 SGEMM was a little better, and I upgraded last night without any problems, apparently - I have Suse 10.3, but used the Redhat 5 packages without an issue.

Padding the arrays…I’ll have to think about how to do that. Presumably array sections padded with 0’s just give harmless blocks of 0’s in the answer C from cublasSgemm. Then when filling the array c with the answer, you’d only have to copy the non-padded portion of the array.

The “20% improvement” above was not quite fair - since I was timing a large loop of a bunch of calculations, and only one line (for now) used the mex function. The cublas computation time on that particular line was actually a little less than 2/3 of the matlab computation time. However, converting the same calculation to single precision in matlab, for an apples-to-apples comparision, found that CUDA gave only 20% improvement. These are matrices ca. 3000 in size. I need to call this routine quite a lot, so its worth tuning it up.

It is true, as mentioned elsewhere on this forum, that the first call of this mex function is lengthy, but subsequent calls are speedy.

Thanks for the reply!

Stumbling around some more, I’ve come up with the revised code listed below. It demands single precision input arrays at the start, and so does away with the double/single conversions, and (whoohoo!) runs about twice as fast as the equivalent matlab code. Note that the matlab code is running on a quadcore AMD cpu, and matlab is using openmp to employ all four cores in these calculations. So CUDA is giving me a factor of 2 over a quad-core calculation here. Now to work on padding the arrays…

And I’ve verified that, indeed, single precision works o.k. with my particular problem.

#include "mex.h"

#include "cublas.h"

/*  sgemm_cu.cu - Gateway function for subroutine sgemm

  function C = sgemm_cu(transa,transb,single(alpha),single(beta),single(A),single(B),single(C))

  transa,transb = 0/1 for no transpose/transpose of A,B

  Input arrays must be single precision.

*/

void mexFunction( int nlhs, mxArray *plhs[],

    int nrhs, const mxArray *prhs[])

{

      cublasStatus status;

      int M,K,L,N,MM,NN,KK;

      int dims0[2];

      int ta,tb;

      float alpha,beta;

      float *a,*b,*c,*cc;

      float *ga,*gb,*gc;

      char transa,transb;

      cublasStatus retStatus;

     if (nrhs != 7) {

          mexErrMsgTxt("sgemm requires 7 input arguments");

      } else if (nlhs != 1) {

          mexErrMsgTxt("sgemm requires 1 output argument");

      }

     if ( !mxIsSingle(prhs[4]) || 

           !mxIsSingle(prhs[5]) ||

           !mxIsSingle(prhs[6]))   {

           mexErrMsgTxt("Input arrays must be single precision.");

      }

     ta = (int) mxGetScalar(prhs[0]);

      tb = (int) mxGetScalar(prhs[1]);

      alpha = (float) mxGetScalar(prhs[2]);

      beta = (float) mxGetScalar(prhs[3]);

     M = mxGetM(prhs[4]);   /* gets number of rows of A */

      K = mxGetN(prhs[4]);   /* gets number of columns of A */

      L = mxGetM(prhs[5]);   /* gets number of rows of B */

      N = mxGetN(prhs[5]);   /* gets number of columns of B */

     if (ta == 0) {

          transa='n';

          MM=M;

          KK=K;

      } else {

          transa='t';

          MM=K;

          KK=M;

      }

     if (tb == 0) {

          transb='n';

          NN=N;

      } else {

          transb='t';

          NN=L;

      }

/*    printf("transa=%c\n",transa); 

      printf("transb=%c\n",transb); 

      printf("alpha=%f\n",alpha); 

      printf("beta=%f\n",beta);     */

/* Left hand side matrix set up */

      dims0[0]=MM;

      dims0[1]=NN;

      plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

      cc = (float*) mxGetData(plhs[0]);

      

/* Single-precision arrays */

/* Matrix 1 */

      a = (float*) mxGetData(prhs[4]);

/* Matrix 2 */

      b = (float*) mxGetData(prhs[5]);

/* Matrix 3 */

      c = (float*) mxGetData(prhs[6]);

/* STARTUP   CUBLAS */

      retStatus = cublasInit();

     // test for error

     retStatus = cublasGetError ();

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

        printf("CUBLAS: an error occurred in cublasInit\n");

      } 

/* ALLOCATE SPACE ON THE GPU AND COPY a INTO IT */

      cublasAlloc (M*K, sizeof(float), (void**)&ga);

      // test for error

      retStatus = cublasGetError ();

      if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occurred in cublasAlloc\n");

      } 

      retStatus = cublasSetMatrix (M, K, sizeof(float),

               a, M, (void*)ga, M);

/* SAME FOR B, C */

      cublasAlloc (L*N, sizeof(float), (void**)&gb);

      retStatus = cublasSetMatrix (L, N, sizeof(float),

               b, L, (void*)gb, L);

     cublasAlloc (NN*MM, sizeof(float), (void**)&gc);

      retStatus = cublasSetMatrix (MM, NN, sizeof(float),

               c, MM, (void*)gc, MM);

/*    printf("Op(A) has No. rows = %i\n",MM);

      printf("Op(B) has No. cols = %i\n",NN);

      printf("Op(A) has No. cols = %i\n",KK);

      printf("A has leading dimension = %i\n",M);

      printf("B has leading dimension = %i\n",L);

      printf("C has leading dimension = %i\n",MM); */

/* READY TO CALL SGEMM */

    (void) cublasSgemm (transa, transb, MM, NN, KK, alpha, 

                                ga, M, gb, L, beta, gc, MM); 

    status = cublasGetError();

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! kernel execution error.\n");

    }

/* NOW COPY THE RESULTING gc ON THE GPU TO THE LOCAL c */

     retStatus = cublasGetMatrix (MM, NN, sizeof(float), gc, MM, cc, MM);

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occurred in cublasGetMatrix\n");

     } else { 

     printf("");

     }

/* FREE UP GPU MEMORY AND SHUTDOWN (OPTIONAL?) */

      cublasFree (ga);

      cublasFree (gb);

      cublasFree (gc);

      cublasShutdown();  

}

Stumbling around yet some more, below is code that might pad the arrays into multiples of 32. No guarantees…I really don’t know what I am doing…happy to hear about my mistakes or corrections…

After allocating the CUDA arrays, I fill them with zeros using Memset; I presume that’s the right thing to do to avoid memory garbage in the padded areas of the arrays.

Padded like this seems to give about 30% improvement in speed, so we’re really “cooking with gas” now!

#include "mex.h"

#include "cublas.h"

/*  sgemm_cu.cu - Gateway function for subroutine sgemm

  function C = sgemm_cu(transa,transb,single(alpha),single(beta),single(A),single(B),single(C))

  transa,transb = 0/1 for no transpose/transpose of A,B

  Input arrays must be single precision.

*/

void mexFunction( int nlhs, mxArray *plhs[],

    int nrhs, const mxArray *prhs[])

{

      cublasStatus status;

      int M,K,L,N,MM,NN,KK;

      int Mc,Kc,Lc,Nc,MMc,NNc,KKc;

      int dims0[2];

      int ta,tb;

      float alpha,beta;

      float *a,*b,*c,*cc;

      float *ga,*gb,*gc;

      char transa,transb;

      cublasStatus retStatus;

     if (nrhs != 7) {

          mexErrMsgTxt("sgemm requires 7 input arguments");

      } else if (nlhs != 1) {

          mexErrMsgTxt("sgemm requires 1 output argument");

      }

     if ( !mxIsSingle(prhs[4]) || 

           !mxIsSingle(prhs[5]) ||

           !mxIsSingle(prhs[6]))   {

           mexErrMsgTxt("Input arrays must be single precision.");

      }

     ta = (int) mxGetScalar(prhs[0]);

      tb = (int) mxGetScalar(prhs[1]);

      alpha = (float) mxGetScalar(prhs[2]);

      beta = (float) mxGetScalar(prhs[3]);

     M = mxGetM(prhs[4]);   /* gets number of rows of A */

      K = mxGetN(prhs[4]);   /* gets number of columns of A */

      L = mxGetM(prhs[5]);   /* gets number of rows of B */

      N = mxGetN(prhs[5]);   /* gets number of columns of B */

     if (ta == 0) {

          transa='n';

          MM=M;

          KK=K;

      } else {

          transa='t';

          MM=K;

          KK=M;

      }

     if (tb == 0) {

          transb='n';

          NN=N;

      } else {

          transb='t';

          NN=L;

      }

/*    printf("transa=%c\n",transa); 

      printf("transb=%c\n",transb); 

      printf("alpha=%f\n",alpha); 

      printf("beta=%f\n",beta);     */

/* Left hand side matrix set up */

      dims0[0]=MM;

      dims0[1]=NN;

      plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

      cc = (float*) mxGetData(plhs[0]);

      

/* Single-precision arrays */

/* Matrix 1 */

      a = (float*) mxGetData(prhs[4]);

/* Matrix 2 */

      b = (float*) mxGetData(prhs[5]);

/* Matrix 3 */

      c = (float*) mxGetData(prhs[6]);

/* STARTUP   CUBLAS */

      retStatus = cublasInit();

     // test for error

     retStatus = cublasGetError ();

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

        printf("CUBLAS: an error occurred in cublasInit\n");

      } 

    Mc=M+32-M%32;

     Kc=K+32-K%32;

/* ALLOCATE SPACE ON THE GPU AND COPY a INTO IT */

      cublasAlloc (Mc*Kc, sizeof(float), (void**)&ga);

      // test for error

      retStatus = cublasGetError ();

      if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occurred in cublasAlloc\n");

      } 

      cudaMemset(ga,0,Mc*Kc*4);

      retStatus = cublasSetMatrix (M, K, sizeof(float),

               a, M, (void*)ga, Mc);

     Lc=L+32-L%32;

      Nc=N+32-N%32;

/* SAME FOR B, C */

      cublasAlloc (Lc*Nc, sizeof(float), (void**)&gb);

      cudaMemset(gb,0,Lc*Nc*4);

      retStatus = cublasSetMatrix (L, N, sizeof(float),

               b, L, (void*)gb, Lc);

     MMc=MM+32-MM%32;

      NNc=NN+32-NN%32;

      KKc=KK+32-KK%32;

      cublasAlloc (MMc*NNc, sizeof(float), (void**)&gc);

      if (beta != 0.0 ) {

         cudaMemset(gc,0,MMc*NNc*4);

         retStatus = cublasSetMatrix (MM, NN, sizeof(float),

                  c, MM, (void*)gc, MMc);

      }

/*  PADDED ARRAYS */

/*    printf("Op(A) has No. rows = %i\n",MMc);

      printf("Op(B) has No. cols = %i\n",NNc);

      printf("Op(A) has No. cols = %i\n",KKc);

      printf("A has leading dimension = %i\n",Mc);

      printf("B has leading dimension = %i\n",Lc);

      printf("C has leading dimension = %i\n",MMc); */

/* READY TO CALL SGEMM */

    (void) cublasSgemm (transa, transb, MMc, NNc, KKc, alpha, 

                                ga, Mc, gb, Lc, beta, gc, MMc); 

    status = cublasGetError();

    if (status != CUBLAS_STATUS_SUCCESS) {

        fprintf (stderr, "!!!! kernel execution error.\n");

    }

/* NOW COPY THE RESULTING gc ON THE GPU TO THE LOCAL c */

     retStatus = cublasGetMatrix (MM, NN, sizeof(float), gc, MMc, cc, MM);

     if (retStatus != CUBLAS_STATUS_SUCCESS) {

          printf("CUBLAS: an error occurred in cublasGetMatrix\n");

     }

/* FREE UP GPU MEMORY AND SHUTDOWN (OPTIONAL?) */

      cublasFree (ga);

      cublasFree (gb);

      cublasFree (gc);

      cublasShutdown();  

}

One additional optimization, if beta=0 you don’t need to send C to the GPU.

Thanks for that - I’ve edited the code above to implement that. It seems to be worth a few/several percent improvement in speed.

At the end of the day, literally and figuratively, this mex code is giving me about a factor of 2 speed up in my real-world application (which is a Kalman filter project, incidentally). Or a factor of 2 going to single precision, and another factor of 2 going to CUDA.

(What’s with the gray back ground on this forum? I can hardly read anything anymore!)

time for…

:icecream:

Thanks for the clean demo! A comment and a question:

  1. It seems that
    Mc = M + 32 - M%32;
    adds 32 to Mc whenever Mc is a multiple of 32. What about changing it to
    Mc = M + (M%32 != 0)*(32 - M%32); ??

  2. What kind of changes would be required in order to use it with complex data - mxCOMPLEX? I mean, how the complex data coming from MATLAB have to be copied to the GPU memory?

Thanks again

I tried modifying the last piece of code you posted to make two calls to cublasSgemm. Execution takes a longer amount of time, but the answer returned to MATLAB is the same as if the call only happened once.

[codebox]

for (int i = 0; i < 20; i++) {

cudaThreadSynchronize();

/* READY TO CALL SGEMM */

(void) cublasSgemm (transa, transb, MM, NN, KK, alpha, ga, M, gb, L, beta, gc, MM);

status = cublasGetError();

if (status != CUBLAS_STATUS_SUCCESS) {

   fprintf (stderr, "!!!! kernel execution error.\n");

}

}

[/codebox]

Is there something I’m missing?

Thanks!

One thing that might be worth noting is the fact that I pass the same variable for B and Z. I am effectively computing Z = transpose(A)Z + 0Z over and over. The result is as if I just did that computation once.

Is there something wrong with trying to do this computation on the GPU through Matlab?