Help with CUDA+Matlab acceleration

bizi10 · August 29, 2008, 9:02pm

Hello,

I have a Matlab file where I call a computation function like this:

[iRy,s,t]=Compute(iRy,yrb,s);

And the compute function that is called looks like this:

function [mat,kAdd,t]=Compute(org,add,s);

lenAdd=size(add,2);

kAdd=s+lenAdd;

orgAvg=org/s;

ident=eye(lenAdd,lenAdd);

t=0;

if(lenAdd==1)

    mat=kAdd*(orgAvg-(orgAvg*add*add'*orgAvg)/(ident+add'*orgAvg*add));

else

    mat=kAdd*(orgAvg-orgAvg*add*pinv(ident+add'*orgAvg*add)*add'*orgAvg);

end;

Now all I would need is CUDA accelerated Compute function but as I’m new to CUDA programing I have great difficulties implementing it.

Could anyone give me some help with it?

James_Malcolm1 · August 30, 2008, 3:51am

I admit this is a bit of a shameless plug, but…have you tried seeing if AccelerEyes’ Jacket for Matlab gives you the speedup you want? I’m part of the team working on it. It’s a runtime engine that offloads Matlab computation onto CUDA-enabled GPUs. You’d need to “vectorize” your Matlab code to really get the best performance. For more: AccelerEyes.
-James

bizi10 · August 30, 2008, 5:30am

Sorry but what do you mean with shameless that it would be easy to implement it on GPU or hard?

And yes I tried using Jacket at the point:

mat=gsingle(kAdd*(orgAvg-(orgAvgaddadd’*orgAvg)/(ident+add’orgAvgadd)));

but got errors later on for no particular reason as mat should be correctly computed by Jacket.

James_Malcolm1 · August 30, 2008, 7:44am

I was mostly joking. “Shameless” because I’m part of the team building Jacket, so it’s self-promotion.

Since this is really particular to Jacket and not CUDA, how about we continue on the AccelerEyes Forums. Could you post more of your code there and I can try figure out what went wrong?

-James

bizi10 · August 30, 2008, 8:43am

No problem I will post on AccelerEyes Forum also but I would still need a native CUDA implementation of this function. Is anyone prepared to give me a hand?

E.D_Riedijk · August 30, 2008, 11:14am

I could help you, did you already write the mexfunction to get the data in C?

bizi10 · August 30, 2008, 12:31pm

No I didn’t. I was looking at some options on how to extract .m file into c but no success for now.

E.D_Riedijk · August 30, 2008, 7:45pm

If you have the realtime workshop (maybe you also need embedded coder, not sure) you can convert your m file into a c file including a mexfunction interface. Then you can replace all the c code with a call to cuda and keep the interface part.
If you have trouble making a matlab-c interface, I would advise using jacket though.

bizi10 · August 30, 2008, 9:29pm

I will try to export M file to C code tomorrow and see where does it take me. I just hope that when I get this subfunction to run on the GPU that it will be able to make use of GPU architecture and actualy be faster than the CPU computation.

Also I was informed that “Pinv - will likely be a problem because Jacket (and CUDA) don’t yet support matrix inverse”

Is that true?

melonakos · August 31, 2008, 1:45am

Yes, unfortunately, CUDA (and, hence, Jacket) do not support ‘pinv’ at this point. These functions will likely come online in the near future. Jacket will have ‘pinv’ as soon as CUDA gets it, which I’m guessing will happen when CUDA gets SVD, inv, and other eigenvalue/vector decomposition stuff.

Best,

John Melonakos

AccelerEyes

E.D_Riedijk · August 31, 2008, 11:50am

I have been looking at inverting matrices in cuda. But then it turned out, I did not really need it for the speed as the matrix inversion was negligible in time, so I actually gave up on it.
What I do in my CUDA mex-code (big chunk) is copy the matrix that needs inverting back to cpu (matlab) in single precision and calling matlab from my mex file to do the inversion. Then I copy the inverted matrix again back to GPU for use in the rest of my cuda code.

bizi10 · August 31, 2008, 5:10pm

That’s sound positive but also after I analyzed Matlab profiler results the bit where the pinv is called mat=kAdd*(orgAvg-orgAvgaddpinv(ident+add’orgAvgadd)*add’*orgAvg); is used only once and takes very little fraction of the processing time so in the end I don’t really need to do it on the GPU.

All I would need to do on a GPU would be:

mat=kAdd*(orgAvg-(orgAvgaddadd’*orgAvg)/(ident+add’orgAvgadd));

As this is the bit where 90% of all computation is done.

So if I understand correctly I should first put this function in C code and then make it CUDA compatible and compile it using nvcc.

So could anyone give me a hint how to put that into C code (I tried to do it using Simulink but ended with big and wierd C code)? If anyone would be able to help me I can even send the whole code on someones email if you guys maybe have some other tools to convert .m files to C code.

E.D_Riedijk · August 31, 2008, 8:39pm

The time where I used this trick, I just replaced all eml_T 's with float, etc, etc. Then I removed the fixed array-dimensions to have a readable pure C version to compare with.

Then I made a CUDA kernel from taking the calculating part, and removing the for loops.

Otherwise just take a look at the example code that comes with the matlab plugin. You can adjust the mexFunction interface of that one to handle your input and output arrays. The handy thing about matlab->c is that it already checks for the arguments you will give it.

bizi10 · September 1, 2008, 7:11am

One small question would be also what are the optimal numbers of blocks, threads etc. for 8800GT card?

If I understand correctly if a 8800GTX with 768 MB with 8 units needs minimal 96 (768/8) threads then the 8800GT with 512 MB and 7 processing units needs a minimal od 73 threads (512/7)?

And furthermore if a 8800GTX has a TILE_WIDTH of 16 and 256 threads then optimal for 8800GT is TILE_WIDTH 14 and so 196 threads?

Also similar is 4.096 thread blocks for 8800GTX and 3584 thread blocks for 8800GT.

So on 8800GTX each thread block performs 512 float loads from device memory for 256216=8.192 mul/add operations and on 8800GT it performs 384
float loads from device memory for 192214=5.376 mul/add operations.

Are those numbers correct?

E.D_Riedijk · September 1, 2008, 12:54pm

optimal from a hardware point of view is throw as many blocks at it of 256 threads per block (or 512 threads per block for GT200 hardware)

bizi10 · September 2, 2008, 9:00am

Ok thanks for the answer.

The next question would be:

What would be the easiest way to do a Matrix Multiplication on a CUDA enabled hardware callable from Matlab? In programing manual there is an example on matrix multiplication in CUDA but not in Matlab.

So in order that I would be able to run that example from Programming manual in on a Matlab matrix I would have to alter it probably with gateway function? And also alter some other calls?

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

And then compile it with something like:

nvmex -f nvmexopts.bat matrixMul.cu -IC:\cuda\include -LC:\cuda\lib -lcudart

Or is there already a working cuda enabled matrix multiplication function for matlab?

James_Malcolm1 · September 2, 2008, 9:18am

If you want to do the multiplication out on the card, check out the CUBLAS library’s cublasSgemm() function. A typical way to call it on Matlab format matrices C=A*B stored in linear CUDA device memory is this:

       cublasSgemm('N', 'N',

                    dims_A[0], /* leading dimension */

                    dims_B[1], /* trailing   "      */

                    dims_A[1], /* inner      "      */

                    1, /* alpha */

                    d_A, dims_A[0],

                    d_B, dims_B[0],

                    0, /* beta */

                    d_C, dims_A[0]);

bizi10 · September 2, 2008, 9:06pm

If you want to do the multiplication out on the card, check out the CUBLAS library’s cublasSgemm() function. A typical way to call it on Matlab format matrices C=A*B stored in linear CUDA device memory is this:

ï¿½ Â ï¿½ Â ï¿½ Â Â cublasSgemm('N', 'N',

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â dims_A[0], /* leading dimension */

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â dims_B[1], /* trailing ï¿½ Â " ï¿½ Â ï¿½ Â Â */

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â dims_A[1], /* inner ï¿½ Â ï¿½ Â Â " ï¿½ Â ï¿½ Â Â */

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â 1, /* alpha */

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â d_A, dims_A[0],

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â d_B, dims_B[0],

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â 0, /* beta */

 ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â ï¿½ Â Â d_C, dims_A[0]);

[snapback]434313[/snapback]

Nice I will try that.

Another thing I will have to figure is a Matrix division of 512x512 matrix with for example one integer. How would I be able to do that? (A=B/C)

Example:

A[1 2] [3 4] = B [10 20] [30 40]/ C 10

Is it even worth implementing this on a GPU or will the time copying the data to and from the device negate any speed gains?

I should point out that the B matrix(512x512) is fixed so it could be copied on the device only once. And the C is rising incrementally. Every time I finish and get A matrix (512x512) the C becomes bigger by 1 and I start Matrix division again. I repeat this procedure around 10.000 times. Any suggestions how to make the best use of CUDA for this?

theMarix · September 3, 2008, 7:10am

I think it would by of. You could use the scal-function. Even though that is for vectors.

bizi10 · September 4, 2008, 9:54pm

Well I am making some progress but very limited. :|

I tried running a matrix multiplication on 2 512*512 matrices with the following code from matlab (the code was posted here on the forum):

#include <stdio.h>

#include <stdlib.h>

#include "mex.h"

#include "cublas.h"

/* Rutine za pretvorbo med single in double natancnostjo */

void convert_double2float( double *input_double, float *output_float,int Ntot){

   int i;

   for (i = 0; i < Ntot; i++){

               output_float[i] = (float) input_double[i];

   }

}

void convert_float2double( float *input_float, double *output_double,int Ntot){

   int i;

   for (i = 0; i < Ntot; i++){

               output_double[i] = (double) input_float[i];

   }

}

/*  vhodna funkcija [C]=cublasMMul(A,B) */

void mexFunction( int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){

     int M,K,L,N;

     int dims0[2];

     float alpha,beta;

     double *A,*B,*C;

     float *a,*b,*c;

     float *ga,*gb,*gc;

     cublasStatus retStatus;

    if (nrhs != 2) {

         mexErrMsgTxt("rabimo 2 vhodna argmenta");

     } else if (nlhs != 1) {

         mexErrMsgTxt("rabimo 1 izhodni argument");

     }

     

     alpha = 1.0;

     beta = 0.0;

    M = mxGetM(prhs[0]);   /* dobimo stevilo vrstic A */

     K = mxGetN(prhs[0]);   /* dobimo stevilo stolpcev A */

     L = mxGetM(prhs[1]);   /* dobimo stevilo vrstic B */

     N = mxGetN(prhs[1]);   /* dobimo stevilo stolpcev B */

     /* K in L morata biti iste velikosti */

/* Keiranje matrike */

     dims0[0]=M;

     dims0[1]=N;

     plhs[0] = mxCreateNumericArray(2,dims0,mxDOUBLE_CLASS,mxREAL);

     C = mxGetPr(plhs[0]);

     

/* Matrika 1 */

     A = mxGetPr(prhs[0]);

/* Matrika 2 */

     B = mxGetPr(prhs[1]);

/* alociranje delovnega polja na gostitelju */

     a  = (float*) mxMalloc(sizeof(float)*M*K);

     b  = (float*) mxMalloc(sizeof(float)*N*K);

    convert_double2float(A,a,M*K);

     convert_double2float(B,b,N*K);

    c  = (float*) mxMalloc(sizeof(float)*M*N);

/* Zagon Cublas knjiznice */

     retStatus = cublasInit();

    // testiramo za napake

    retStatus = cublasGetError ();

    if (retStatus != CUBLAS_STATUS_SUCCESS) {

       printf("CUBLAS: napaka v cublasInit\n");

     } else {

       printf("CUBLAS: cublasInit deluje normalno\n");

     }

 /* Dodeljevanje prostora na GPU enoti in kopiranje podatkov na napravo */

     cublasAlloc (M*K, sizeof(float), (void**)&ga);

     cublasSetMatrix (M, K, sizeof(float),a, M, (void*)ga, M);

    cublasAlloc (N*K, sizeof(float), (void**)&gb);

     cublasSetMatrix (K, N, sizeof(float),b, K, (void*)gb, K);

    cublasAlloc (N*M, sizeof(float), (void**)&gc);

     cublasSetMatrix (M, N, sizeof(float),c, M, (void*)gc, M);

 /* Klic cublasSgemm knjiznice */

     (void) cublasSgemm ('n','n',M,N,K,alpha,ga,M,gb,L,beta,gc,M);

/* kopiranje rezultata na c */

    cublasGetMatrix (M, N, sizeof(float),  gc, M, c, M);

/* sprostimo pomnilnik in izklopimo cublas */

     cublasFree (ga);

     cublasFree (gb);

     cublasFree (gc);

     cublasShutdown();  

    convert_float2double(c,C,M*N);

}

But results are actually very bad. A CPU gives results six times faster than the GPU (probably because of the data transfer from and to the device.) Also I dont se any usage of parallelism in the above program. Is there any? Is cublasSgemm taking any advantage of multithreading? Any hints on optimizing this code?

And one more question regarding mexFunction( int nlhs, mxArray *plhs,int nrhs, const mxArray *prhs)

nlhs = number of expected mxArrays (Left Hand Side)

plhs = array of pointers to expected outputs

nrhs = number of inputs (Right Hand Side)

prhs = array of pointers to input data. The input data is read-only

Is there any limit on the numer or size of the nlhs, plhs, nrhs and prhs?

And finally did anyone try to implement the standard CUDA C example matrix multiplication to the Matlab environment?

Topic		Replies	Views
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	5016	February 10, 2011
Matlab mex file using cublas - problems CUDA Programming and Performance	13	8989	October 13, 2009
Tell me a way to GPU speed up with banal elementwise multiplication.. Thanks in advance! CUDA Programming and Performance	15	3636	August 5, 2017
Speed up Matlab code without changing it? CUDA Programming and Performance	2	3529	July 16, 2008
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3454	December 26, 2008
Inverse of a 3x3 matrix CUDA Programming and Performance	22	9505	February 20, 2024
Sparse Matrix-Matrix Multiplication CUDA Programming and Performance	14	18200	June 15, 2010
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10121	January 20, 2010
How can I improve this code which only reduces half time for the same code using MATLAB, thanks! CUDA Programming and Performance	5	1639	September 29, 2009
plain matrix multiplication very slow (implemented with Matlab mex) CUDA Programming and Performance	4	1678	August 1, 2016

Help with CUDA+Matlab acceleration

Related topics