Matlab, Mex files & CUDA

Hello to everybody,

I’ a student newbie to Matlab and Cuda. I have to write down a simple Mex file which takes a vector as an input and then calls a routine from a shared library in CUDA(void twotimes(float* x, float *y, int n)) which simply multiplies by two each element of such a vector.

This is the code of the mex file:

#include “mex.h”
#include “matrix.h”

/*Headr file of the shared library */
#include “doppiolgcc.h”

void myExitFcn()
{
mexPrintf(“MEX-file is being unloaded”);
}

void mexFunction(int nlhs, mxArray *plhs, int nrhs,
const mxArray *prhs)
{
double *x, *y;
int i;
int mrows, ncols;

/* The input must be a noncomplex floating-point vector*/
mrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
if (!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) ||
    !(ncols == 1)) {
  mexErrMsgTxt("Input must be a noncomplex floating-point vector.");
}
  
/* Assign pointers to each input and output. */
x = mxGetPr(prhs[0]);
plhs[0] = mxCreateDoubleMatrix(mrows, ncols, mxREAL);
y = mxGetPr(plhs[0]);
  
/*Call the external routine */
timestwo(x, y, mrows); 

if(mexAtExit(myExitFcn))
{
  mexPrintf("Error unloading function!");
}

}

The code of the header file is the following

extern “C” void timestwo(float *x, float *y, int LEN);

And this is the simple implementation in CUDA of such a procedure

const int N = 256;

global void vecAdd(float* A, float* B)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
B[i] = A[i]*2.0;
}

extern “C” void timestwo(float *x, float y, int len)
{
/
pointers to device memory */
float *x_d, *y_d;

/* Allocate arrays x_d, y_d on device*/
cudaMalloc ((void **) &x_d, sizeof(float)*len);
cudaMalloc ((void **) &y_d, sizeof(float)*len);

/* Copy data from host memory to device memory */
cudaMemcpy(x_d, x, sizeof(float)*len, cudaMemcpyHostToDevice);

/* Launch the computation*/ 
vecAdd<<< N/len, len>>>(x_d, y_d);

/* Copy data from deveice memory to host memory */
cudaMemcpy(y, y_d, sizeof(float)*len, cudaMemcpyDeviceToHost);

/* Free the memory */
cudaFree(x_d); 
cudaFree(y_d);  

}

After doing so, I compile successfully and then I launch my application.
I initalize my variable input like this:

for i=1:256
a(i)=i;
a=a’
end

This is the the final output:

b = doppiom(a, i)

b =

       256
       512
       768
      1024
      1280
      1536
      1792
      2048
      2304
      2560
      2816
      3072
      3328
      3584
      3840
      4096
      4352
      4608
      4864
      5120
      5376
      5632
      5888
      6144
      6400
      6656
      6912
      7168
      7424
      7680
      7936
      8192
      8448
      8704
      8960
      9216
      9472
      9728
      9984
     10240
     10496
     10752
     11008
     11264
     11520
     11776
     12032
     12288
     12544
     12800
     13056
     13312
     13568
     13824
     14080
     14336
     14592
     14848
     15104
     15360
     15616
     15872
     16128
     16384
     16640
     16896
     17152
     17408
     17664
     17920
     18176
     18432
     18688
     18944
     19200
     19456
     19712
     19968
     20224
     20480
     20736
     20992
     21248
     21504
     21760
     22016
     22272
     22528
     22784
     23040
     23296
     23552
     23808
     24064
     24320
     24576
     24832
     25088
     25344
     25600
     25856
     26112
     26368
     26624
     26880
     27136
     27392
     27648
     27904
     28160
     28416
     28672
     28928
     29184
     29440
     29696
     29952
     30208
     30464
     30720
     30976
     31232
     31488
     31744
     32000
     32256
     32512
     32768
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0
         0

So, I obtain b = b[i] : b[i] = a[256]*2^i for few i, instead of b[i] = a[i]*2 for all i. How come doesn’t it work?

Thank you all very much!

Jason.

You kernel launch configuration doesn’t seem right. Total number of threads( blocks * threads per blocks) should be equal to total number of vector elements you want to operate on, as each thread computes one element of the vector in your case.

Now it works, thank you very much for your kind answer and I aplogize for my belated reply!

Best,

Jason.

Yes, instead of writing your own MEX file ( and actually learning something ) maybe you should try to use jacket?