Matlab, Mex files & CUDA

jasonroger · January 24, 2012, 2:39pm

Hello to everybody,

I’ a student newbie to Matlab and Cuda. I have to write down a simple Mex file which takes a vector as an input and then calls a routine from a shared library in CUDA(void twotimes(float* x, float *y, int n)) which simply multiplies by two each element of such a vector.

This is the code of the mex file:

#include “mex.h”
#include “matrix.h”

/*Headr file of the shared library */
#include “doppiolgcc.h”

void myExitFcn()
{
mexPrintf(“MEX-file is being unloaded”);
}

void mexFunction(int nlhs, mxArray *plhs, int nrhs,
const mxArray *prhs)
{
double *x, *y;
int i;
int mrows, ncols;

/* The input must be a noncomplex floating-point vector*/
mrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
if (!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) ||
    !(ncols == 1)) {
  mexErrMsgTxt("Input must be a noncomplex floating-point vector.");
}
  
/* Assign pointers to each input and output. */
x = mxGetPr(prhs[0]);
plhs[0] = mxCreateDoubleMatrix(mrows, ncols, mxREAL);
y = mxGetPr(plhs[0]);
  
/*Call the external routine */
timestwo(x, y, mrows); 

if(mexAtExit(myExitFcn))
{
  mexPrintf("Error unloading function!");
}

}

The code of the header file is the following

extern “C” void timestwo(float *x, float *y, int LEN);

And this is the simple implementation in CUDA of such a procedure

const int N = 256;

global void vecAdd(float* A, float* B)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
B[i] = A[i]*2.0;
}

extern “C” void timestwo(float *x, float y, int len)
{
/ pointers to device memory */
float *x_d, *y_d;

/* Allocate arrays x_d, y_d on device*/
cudaMalloc ((void **) &x_d, sizeof(float)*len);
cudaMalloc ((void **) &y_d, sizeof(float)*len);

/* Copy data from host memory to device memory */
cudaMemcpy(x_d, x, sizeof(float)*len, cudaMemcpyHostToDevice);

/* Launch the computation*/ 
vecAdd<<< N/len, len>>>(x_d, y_d);

/* Copy data from deveice memory to host memory */
cudaMemcpy(y, y_d, sizeof(float)*len, cudaMemcpyDeviceToHost);

/* Free the memory */
cudaFree(x_d); 
cudaFree(y_d);

}

After doing so, I compile successfully and then I launch my application.
I initalize my variable input like this:

for i=1:256
a(i)=i;
a=a’
end

This is the the final output:

b = doppiom(a, i)

b =

So, I obtain b = b[i] : b[i] = a[256]*2^i for few i, instead of b[i] = a[i]*2 for all i. How come doesn’t it work?

Thank you all very much!

Jason.

short · January 24, 2012, 4:15pm

You kernel launch configuration doesn’t seem right. Total number of threads( blocks * threads per blocks) should be equal to total number of vector elements you want to operate on, as each thread computes one element of the vector in your case.

jasonroger · February 24, 2012, 9:01am

Now it works, thank you very much for your kind answer and I aplogize for my belated reply!

Best,

Jason.

_constant · February 25, 2012, 8:52am

Yes, instead of writing your own MEX file ( and actually learning something ) maybe you should try to use jacket?