Help with CUDA+Matlab acceleration

Hello,

I have a Matlab file where I call a computation function like this:

[iRy,s,t]=Compute(iRy,yrb,s);

And the compute function that is called looks like this:

function [mat,kAdd,t]=Compute(org,add,s);

lenAdd=size(add,2);

kAdd=s+lenAdd;

orgAvg=org/s;

ident=eye(lenAdd,lenAdd);

t=0;

if(lenAdd==1)

    mat=kAdd*(orgAvg-(orgAvg*add*add'*orgAvg)/(ident+add'*orgAvg*add));

else

    mat=kAdd*(orgAvg-orgAvg*add*pinv(ident+add'*orgAvg*add)*add'*orgAvg);

end;

Now all I would need is CUDA accelerated Compute function but as I’m new to CUDA programing I have great difficulties implementing it.

Could anyone give me some help with it?

I admit this is a bit of a shameless plug, but…have you tried seeing if AccelerEyes’ Jacket for Matlab gives you the speedup you want? I’m part of the team working on it. It’s a runtime engine that offloads Matlab computation onto CUDA-enabled GPUs. You’d need to “vectorize” your Matlab code to really get the best performance. For more: AccelerEyes.
-James

Sorry but what do you mean with shameless that it would be easy to implement it on GPU or hard?

And yes I tried using Jacket at the point:

mat=gsingle(kAdd*(orgAvg-(orgAvgaddadd’*orgAvg)/(ident+add’orgAvgadd)));

but got errors later on for no particular reason as mat should be correctly computed by Jacket.

I was mostly joking. “Shameless” because I’m part of the team building Jacket, so it’s self-promotion.

Since this is really particular to Jacket and not CUDA, how about we continue on the AccelerEyes Forums. Could you post more of your code there and I can try figure out what went wrong?

-James

No problem I will post on AccelerEyes Forum also but I would still need a native CUDA implementation of this function. Is anyone prepared to give me a hand?

I could help you, did you already write the mexfunction to get the data in C?

No I didn’t. I was looking at some options on how to extract .m file into c but no success for now.

If you have the realtime workshop (maybe you also need embedded coder, not sure) you can convert your m file into a c file including a mexfunction interface. Then you can replace all the c code with a call to cuda and keep the interface part.
If you have trouble making a matlab-c interface, I would advise using jacket though.

I will try to export M file to C code tomorrow and see where does it take me. I just hope that when I get this subfunction to run on the GPU that it will be able to make use of GPU architecture and actualy be faster than the CPU computation.

Also I was informed that “Pinv - will likely be a problem because Jacket (and CUDA) don’t yet support matrix inverse”

Is that true?

Yes, unfortunately, CUDA (and, hence, Jacket) do not support ‘pinv’ at this point. These functions will likely come online in the near future. Jacket will have ‘pinv’ as soon as CUDA gets it, which I’m guessing will happen when CUDA gets SVD, inv, and other eigenvalue/vector decomposition stuff.

Best,

John Melonakos

AccelerEyes

I have been looking at inverting matrices in cuda. But then it turned out, I did not really need it for the speed as the matrix inversion was negligible in time, so I actually gave up on it.
What I do in my CUDA mex-code (big chunk) is copy the matrix that needs inverting back to cpu (matlab) in single precision and calling matlab from my mex file to do the inversion. Then I copy the inverted matrix again back to GPU for use in the rest of my cuda code.

That’s sound positive but also after I analyzed Matlab profiler results the bit where the pinv is called mat=kAdd*(orgAvg-orgAvgaddpinv(ident+add’orgAvgadd)*add’*orgAvg); is used only once and takes very little fraction of the processing time so in the end I don’t really need to do it on the GPU.

All I would need to do on a GPU would be:

mat=kAdd*(orgAvg-(orgAvgaddadd’*orgAvg)/(ident+add’orgAvgadd));

As this is the bit where 90% of all computation is done.

So if I understand correctly I should first put this function in C code and then make it CUDA compatible and compile it using nvcc.

So could anyone give me a hint how to put that into C code (I tried to do it using Simulink but ended with big and wierd C code)? If anyone would be able to help me I can even send the whole code on someones email if you guys maybe have some other tools to convert .m files to C code.

The time where I used this trick, I just replaced all eml_T 's with float, etc, etc. Then I removed the fixed array-dimensions to have a readable pure C version to compare with.

Then I made a CUDA kernel from taking the calculating part, and removing the for loops.

Otherwise just take a look at the example code that comes with the matlab plugin. You can adjust the mexFunction interface of that one to handle your input and output arrays. The handy thing about matlab->c is that it already checks for the arguments you will give it.

One small question would be also what are the optimal numbers of blocks, threads etc. for 8800GT card?

If I understand correctly if a 8800GTX with 768 MB with 8 units needs minimal 96 (768/8) threads then the 8800GT with 512 MB and 7 processing units needs a minimal od 73 threads (512/7)?

And furthermore if a 8800GTX has a TILE_WIDTH of 16 and 256 threads then optimal for 8800GT is TILE_WIDTH 14 and so 196 threads?

Also similar is 4.096 thread blocks for 8800GTX and 3584 thread blocks for 8800GT.

So on 8800GTX each thread block performs 512 float loads from device memory for 256216=8.192 mul/add operations and on 8800GT it performs 384
float loads from device memory for 192214=5.376 mul/add operations.

Are those numbers correct?

optimal from a hardware point of view is throw as many blocks at it of 256 threads per block (or 512 threads per block for GT200 hardware)

Ok thanks for the answer.

The next question would be:

What would be the easiest way to do a Matrix Multiplication on a CUDA enabled hardware callable from Matlab? In programing manual there is an example on matrix multiplication in CUDA but not in Matlab.

So in order that I would be able to run that example from Programming manual in on a Matlab matrix I would have to alter it probably with gateway function? And also alter some other calls?

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

And then compile it with something like:

nvmex -f nvmexopts.bat matrixMul.cu -IC:\cuda\include -LC:\cuda\lib -lcudart

Or is there already a working cuda enabled matrix multiplication function for matlab?

If you want to do the multiplication out on the card, check out the CUBLAS library’s cublasSgemm() function. A typical way to call it on Matlab format matrices C=A*B stored in linear CUDA device memory is this:

       cublasSgemm('N', 'N',

                    dims_A[0], /* leading dimension */

                    dims_B[1], /* trailing   "      */

                    dims_A[1], /* inner      "      */

                    1, /* alpha */

                    d_A, dims_A[0],

                    d_B, dims_B[0],

                    0, /* beta */

                    d_C, dims_A[0]);

Nice I will try that.

Another thing I will have to figure is a Matrix division of 512x512 matrix with for example one integer. How would I be able to do that? (A=B/C)

Example:

A[1 2] [3 4] = B [10 20] [30 40]/ C 10

Is it even worth implementing this on a GPU or will the time copying the data to and from the device negate any speed gains?

I should point out that the B matrix(512x512) is fixed so it could be copied on the device only once. And the C is rising incrementally. Every time I finish and get A matrix (512x512) the C becomes bigger by 1 and I start Matrix division again. I repeat this procedure around 10.000 times. Any suggestions how to make the best use of CUDA for this?

I think it would by of. You could use the scal-function. Even though that is for vectors.

Well I am making some progress but very limited. :|

I tried running a matrix multiplication on 2 512*512 matrices with the following code from matlab (the code was posted here on the forum):

#include <stdio.h>

#include <stdlib.h>

#include "mex.h"

#include "cublas.h"

/* Rutine za pretvorbo med single in double natancnostjo */

void convert_double2float( double *input_double, float *output_float,int Ntot){

   int i;

   for (i = 0; i < Ntot; i++){

               output_float[i] = (float) input_double[i];

   }

}

void convert_float2double( float *input_float, double *output_double,int Ntot){

   int i;

   for (i = 0; i < Ntot; i++){

               output_double[i] = (double) input_float[i];

   }

}

/*  vhodna funkcija [C]=cublasMMul(A,B) */

void mexFunction( int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){

     int M,K,L,N;

     int dims0[2];

     float alpha,beta;

     double *A,*B,*C;

     float *a,*b,*c;

     float *ga,*gb,*gc;

     cublasStatus retStatus;

    if (nrhs != 2) {

         mexErrMsgTxt("rabimo 2 vhodna argmenta");

     } else if (nlhs != 1) {

         mexErrMsgTxt("rabimo 1 izhodni argument");

     }

     

     alpha = 1.0;

     beta = 0.0;

    M = mxGetM(prhs[0]);   /* dobimo stevilo vrstic A */

     K = mxGetN(prhs[0]);   /* dobimo stevilo stolpcev A */

     L = mxGetM(prhs[1]);   /* dobimo stevilo vrstic B */

     N = mxGetN(prhs[1]);   /* dobimo stevilo stolpcev B */

     /* K in L morata biti iste velikosti */

/* Keiranje matrike */

     dims0[0]=M;

     dims0[1]=N;

     plhs[0] = mxCreateNumericArray(2,dims0,mxDOUBLE_CLASS,mxREAL);

     C = mxGetPr(plhs[0]);

     

/* Matrika 1 */

     A = mxGetPr(prhs[0]);

/* Matrika 2 */

     B = mxGetPr(prhs[1]);

/* alociranje delovnega polja na gostitelju */

     a  = (float*) mxMalloc(sizeof(float)*M*K);

     b  = (float*) mxMalloc(sizeof(float)*N*K);

    convert_double2float(A,a,M*K);

     convert_double2float(B,b,N*K);

    c  = (float*) mxMalloc(sizeof(float)*M*N);

/* Zagon Cublas knjiznice */

     retStatus = cublasInit();

    // testiramo za napake

    retStatus = cublasGetError ();

    if (retStatus != CUBLAS_STATUS_SUCCESS) {

       printf("CUBLAS: napaka v cublasInit\n");

     } else {

       printf("CUBLAS: cublasInit deluje normalno\n");

     }

 /* Dodeljevanje prostora na GPU enoti in kopiranje podatkov na napravo */

     cublasAlloc (M*K, sizeof(float), (void**)&ga);

     cublasSetMatrix (M, K, sizeof(float),a, M, (void*)ga, M);

    cublasAlloc (N*K, sizeof(float), (void**)&gb);

     cublasSetMatrix (K, N, sizeof(float),b, K, (void*)gb, K);

    cublasAlloc (N*M, sizeof(float), (void**)&gc);

     cublasSetMatrix (M, N, sizeof(float),c, M, (void*)gc, M);

 /* Klic cublasSgemm knjiznice */

     (void) cublasSgemm ('n','n',M,N,K,alpha,ga,M,gb,L,beta,gc,M);

/* kopiranje rezultata na c */

    cublasGetMatrix (M, N, sizeof(float),  gc, M, c, M);

/* sprostimo pomnilnik in izklopimo cublas */

     cublasFree (ga);

     cublasFree (gb);

     cublasFree (gc);

     cublasShutdown();  

    convert_float2double(c,C,M*N);

}

But results are actually very bad. A CPU gives results six times faster than the GPU (probably because of the data transfer from and to the device.) Also I dont se any usage of parallelism in the above program. Is there any? Is cublasSgemm taking any advantage of multithreading? Any hints on optimizing this code?

And one more question regarding mexFunction( int nlhs, mxArray *plhs,int nrhs, const mxArray *prhs)

nlhs = number of expected mxArrays (Left Hand Side)

plhs = array of pointers to expected outputs

nrhs = number of inputs (Right Hand Side)

prhs = array of pointers to input data. The input data is read-only

Is there any limit on the numer or size of the nlhs, plhs, nrhs and prhs?

And finally did anyone try to implement the standard CUDA C example matrix multiplication to the Matlab environment?