matlab, mex files and loops

I think I know the answer to this question already, but I’ll ask anyways:

Is it possible to use a CUDA-calling mex file in matlab this way:

for k=1:N



end

and have the variables stored on the GPU in the mex file saved for the next call of the mex file?

As it is now, I am copying a large matrix in and off of the GPU, which I don’t really have to…If I could just keep the large array there, the calculation would take 1/3 the time, I think.

If its not possible, perhaps this could be a feature request? ( I can dream can’t I? :) )

(It occurs to me that perhaps there might be a mail list devoted to matlab-related questions; a suggestion.)

A bit more trial-and-error computing later, and I can report the following pointers (no pun intended…) when using CUDA with matlab mex files:

When using loops in matlab that call a mex file, or loops within the mex file, there are two memory “leaks” to be aware of.

The first is that when calling a matlab function from a mex file like:

mexCallMATLAB(1,&lhs[0],2,rhs,"mrdivide");

matlab does not overwrite the lhs pointer with each call, but just keeps allocating more space. If you loop enough, you’ll fill up all your computer’s memory. So, after calling this from my CUDA-using mex file to do the matrix division, I copy the result to the GPU and then immediately destroy the array:

mxDestroyArray(lhs[0]);

which does not make lhs go away, but clears out the allocated memory. No more host memory leak.

The second is that when allocating space on the GPU in a mex file, e.g.,

cudaMalloc ((void **)&A, N * sizeof(A[0]));

that allocated space is not cleared when the mex call terminates. So if the mex file is repeatedly called from a loop in matlab, the GPU memory will fill up. You have to be sure to clear it with:

cudaFree(A);

at the end of the mex file. When the call to my mex file was made repeatedly from a loop in matlab, and I noticed that eventually I got erroneous results until I cleared the GPU allocations. No more GPU memory leak. (Luckily I had an 8800GT with 1 GB of RAM that ran correctly to completion, while my 8800GT with 0.5 GB of RAM failed…hummmm…)

The latter point suggests that it should indeed be possible to call a mex file, allocated/calculate on the GPU, and then have those results available again already on the GPU when the mex file is called again. But how to code that…I dunno. A matter of preserving the pointers to the GPU allocated space, I presume.

Perhaps these tips may be useful to someone, although they look obvious (sort of) to me now.

(We need a matlab-dedicated forum and a CUDA documentation wiki!)

You can clear that memory by doing a clear mex from within matlab I believe.

About keeping things available I found the following on the web:

Simply define your object above the mexFunction entry point. It will remain persistent as long as your mex file stays in memory (i.e. you don’t clear the function).

And this from the mathworks website:

Persistent Arrays

You can exempt an array, or a piece of memory, from the MATLAB automatic cleanup by calling mexMakeArrayPersistent or mexMakeMemoryPersistent. However, if a binary MEX-file creates such persistent objects, there is a danger that a memory leak could occur if the MEX-file is cleared before the persistent object is properly destroyed. To prevent this from happening, a source MEX-file that creates persistent objects should register a function, using the mexAtExit function, which disposes of the objects. (You can use a mexAtExit function to dispose of other resources as well; for example, you can use mexAtExit to close an open file.)

For example, here is a simple source MEX-file that creates a persistent array and properly disposes of it.

#include “mex.h”

static int initialized = 0;
static mxArray *persistent_array_ptr = NULL;

void cleanup(void) {
mexPrintf(“MEX-file is terminating, destroying array\n”);
mxDestroyArray(persistent_array_ptr);
}

void mexFunction(int nlhs,
mxArray *plhs,
int nrhs,
const mxArray *prhs)
{
if (!initialized) {
mexPrintf(“MEX-file initializing, creating array\n”);

/* Create persistent array and register its cleanup. */
persistent_array_ptr = mxCreateDoubleMatrix(1, 1, mxREAL);
mexMakeArrayPersistent(persistent_array_ptr);
mexAtExit(cleanup);
initialized = 1;

/* Set the data of the array to some interesting value. */
*mxGetPr(persistent_array_ptr) = 1.0;

} else {
mexPrintf(“MEX-file executing; value of first array
element is %g\n”, *mxGetPr(persistent_array_ptr));
}
}

Maybe you can have a persistent array of pointers to GPU memory in your mex file.

If you have succes, can you let us know on the forum? I might need to do a similar thing in the near future.

I don’t think I was clear about what I meant above. If you have a mex file that does:

for (i = 0; i < LargeN; i++)

      {

        <some stuff>

        mexCallMATLAB(1,&lhs[0],2,rhs,"mrdivide");

        <copy result to GPU>

        <more stuff>

      }

the code will leak memory - lhs is not “overwritten” but keeps allocating new space. So you have to do:

for (i = 0; i < LargeN; i++)

      {

        <some stuff>

        mexCallMATLAB(1,&lhs[0],2,rhs,"mrdivide");

        <copy result to GPU>

        mxDestroyArray(lhs[0]);

        <more stuff>

      }

and the memory “leak” will stop. (This property is no doubt a feature, rather than a bug.)

Thanks for the notes on persistent pointers. Preserving data on the GPU for later use by a mex file would be a neat trick - I’ll post a simple, successful code example should I hit upon how to do it.

Victory, maybe… The code below appears to bring back a value from the GPU at the end. At the moment when one goes to exit matlab at the end it crashes, however.

#include "mex.h"

#include "cuda.h"

static int initialized = 0;

static mxArray *persistent_array_ptr = NULL;

static mxArray *pArg = NULL;

static float *Arg = NULL;

void cleanup(void) {

mexPrintf("MEX-file is terminating, destroying array\n");

mxDestroyArray(persistent_array_ptr);

mexPrintf("Done destroying array 1.\n");

cudaFree(Arg);

mexPrintf("Freed CUDA Array.\n");

mxDestroyArray(pArg);

mexPrintf("Done destroying array 2.\n");

}

void mexFunction(int nlhs, mxArray *plhs[],

            int nrhs, const mxArray *prhs[])

{

      int N;

      int dims0[2];

      float *val,*val2;

      mxArray *rhs = NULL;

     if (!initialized) {

           mexPrintf("MEX-file initializing, creating array\n");

          /* Create persistent array and register its cleanup. */

           dims0[0]=1; dims0[1]=1; 

           persistent_array_ptr = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);  

           val = (float*) mxGetData(persistent_array_ptr);

           mexMakeArrayPersistent(persistent_array_ptr);

/* Set the data of the array to some interesting value. */

           val[0] = 37.0;

          N=1;

           cudaMalloc ((void **)&Arg, N * sizeof(Arg[0]));

           cudaMemcpy(Arg, val, N*sizeof(float), cudaMemcpyHostToDevice);

           dims0[0]=1; dims0[1]=1; 

           pArg = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);  

           mxSetData(pArg,Arg);  

           mexMakeArrayPersistent(pArg);  

          rhs = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);  

           val2 =  (float*) mxGetData(rhs);

           Arg = (float *) mxGetData(pArg);   

           cudaMemcpy(val2, Arg, N*sizeof(float), cudaMemcpyDeviceToHost);

           mexPrintf("Check of first array element of GPU at initialization; it is %g\n", val2[0]);

          mexAtExit(cleanup);

           initialized = 1;

       } else {

           val = (float*) mxGetData(persistent_array_ptr);

           mexPrintf("MEX-file executing; value of first array element is %g\n", val[0]);

          dims0[0]=1; dims0[1]=1; 

           rhs = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);  

           val2 = (float*) mxGetData(rhs);

           N=1;

           Arg = (float*) mxGetData(pArg); 

           cudaMemcpy(val2, Arg, N*sizeof(float), cudaMemcpyDeviceToHost);

           mexPrintf("Pulling value of first array element of GPU; it is %g\n", val2[0]);

}

}

When I get to work, I will try it out.

I also was not clear ;) What I meant was that when you have filled up all GPU memory by using your mex files, you can clean the GPU memory by using clear mex (at least it seems to have that effect for me)

I see what you mean. I did not have the luxury of waiting for the mex file to finish, however - I watched 8 GB of RAM fill up during the calculation I was trying to do… Then, the calculation on the GPU actually failed within the mex file computation because the card ran out of memory. “O.K… let’s find the leaks…” :)

You were trying to mxDestroyArray() a pointer that points to GPU memory, and that was CudaFree()d just before ;)

The following works without a crash:

#include "mex.h"

#include "cuda.h"

static int initialized = 0;

static mxArray *persistent_array_ptr = NULL;

// static mxArray *pArg = NULL;

static float *Arg = NULL;

void cleanup(void) {

    mexPrintf("MEX-file is terminating, destroying array\n");

    mxDestroyArray(persistent_array_ptr);

    mexPrintf("Done destroying array 1.\n");

    cudaFree(Arg);

    mexPrintf("Freed CUDA Array.\n");

}

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

    int N;

    int dims0[2];

    float *val,*val2;

    mxArray *rhs = NULL;

    

    if (!initialized) {

        mexPrintf("MEX-file initializing, creating array\n");

        

    /* Create persistent array and register its cleanup. */

        dims0[0]=1; dims0[1]=1;

        persistent_array_ptr = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

        val = (float*) mxGetData(persistent_array_ptr);

        mexMakeArrayPersistent(persistent_array_ptr);

    /* Set the data of the array to some interesting value. */

        val[0] = 37.0;

        

        N=1;

        cudaMalloc ((void **)&Arg, N * sizeof(Arg[0]));

        cudaMemcpy(Arg, val, N*sizeof(float), cudaMemcpyHostToDevice);

        dims0[0]=1; dims0[1]=1;

        

        rhs = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

        val2 =  (float*) mxGetData(rhs);

        cudaMemcpy(val2, Arg, N*sizeof(float), cudaMemcpyDeviceToHost);

        mexPrintf("Check of first array element of GPU at initialization; it is %g\n", val2[0]);

        

        mexAtExit(cleanup);

        initialized = 1;

    } else {

        val = (float*) mxGetData(persistent_array_ptr);

        mexPrintf("MEX-file executing; value of first array element is %g\n", val[0]);

        

        dims0[0]=1; dims0[1]=1;

        rhs = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

        val2 = (float*) mxGetData(rhs);

        N=1;

        cudaMemcpy(val2, Arg, N*sizeof(float), cudaMemcpyDeviceToHost);

        mexPrintf("Pulling value of first array element of GPU; it is %g\n", val2[0]);

    }

}

How odd…(and shows you how I am stumbling around…) I take it, therefore, that when

  Â  Â  Â cudaMalloc ((void **)&Arg, N * sizeof(Arg[0]));

is executed, CUDA makes a “persistent” array Arg. That is, Arg can obviously be recovered on the subsequent calls to this mex file.

The “mexMakeArrayPersistent(persistent_array_ptr);” is a red herring insofar as CUDA is concerned!

(I thought I had tried such a thing before, but apparently in not quite the right way.)

REDUCED TO ITS ESSENCE:

#include "mex.h"

#include "cuda.h"

static int initialized = 0;

static float *Arg;

void cleanup(void) {

   mexPrintf("MEX-file is terminating, destroying array\n");

   cudaFree(Arg);

   mexPrintf("Freed CUDA Array.\n");

}

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

   int N;

   int dims0[2];

   float *val;

   mxArray *rhs = NULL;

   

   if (!initialized) {

   /* Create persistent array and register its cleanup. */

       

       mexPrintf("MEX-file initializing, creating GPU array and setting a value.\n");

       N=1;

       cudaMalloc ((void **)&Arg, N * sizeof(Arg[0]));

       

       dims0[0]=1; dims0[1]=1;

       rhs = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

       val =  (float*) mxGetData(rhs);

  /* Set the data of the array to some interesting value. */

       val[0] = 37.0;

      cudaMemcpy(Arg, val, N*sizeof(float), cudaMemcpyHostToDevice);

       

       mexAtExit(cleanup);

       initialized = 1;

   } else {

       

       dims0[0]=1; dims0[1]=1;

       rhs = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);

       val = (float*) mxGetData(rhs);

       N=1;

       cudaMemcpy(val, Arg, N*sizeof(float), cudaMemcpyDeviceToHost);

       mexPrintf("Pulling value of first array element off GPU; it is %g\n", val[0]);

   }

}

(which is pretty slick, if you ask me!)

Yeah, that is indeed pretty short an not a lot of work. In one of the things I copy & pasted in the earlier post, I think I already read that defining the cuda pointer outside of the mexfunction already does the trick. And given that cuda memory appears to be freed when doing a clear mex, I think the cleanup function and mexAtExit(cleanup); is not even needed also!

Hmm, now I have to see if I will go back to my previous project where I could use this… ;)

Hi ,

Has somebody actually tried this code on cuda 2.3 and cuda array , ie declaring a static cudaarray outside the mexfunction and preserving it for consequent mex calls to bind to texture.

It does not work in my case.I get unknown kernel launch failures.Mostly I think its an indication of out of bound access to a memory that is bound to a texture.In case when I use an ordinary static float* there is no kernel launch failure But still kernel reads values outside the memory.

Any suggestion/example would be appreciated regarding a work around.

Regards,

sisutata

Just to note a fix to an annoying problem. The original mex code causes matlab to segmentation fault when the mex file is cleared or matlab exits. Adding “cudaThreadExit():” to the cleanup script solves this problem. The issue with the code seems to be that elements of CUDA linger even after the Arg variable is cleared, so matlab has a memory corruption problem and barfs when the mex file is cleared. A more correct code listing is therefore:

(the value of Arg is incremented by 1 and saved on the GPU with each call of this mex file)

(as indicated, the “cleanup” routine is called when the mex file is cleared or matlab exits)

ADDED LATER: I realized this code had extraneous stuff in it, a product of our initial thrashing around. Hopefully the code below is a nice, clean example now… If the code is copied to “persist.cu” and compiled, the mex file should be run like:

a=single(15);

persist(a)

persist         % after the first call, any arguments are ignored.

persist

(etc)
#include "mex.h"

#include "cuda.h"

static int initialized = 0;

static float *Arg;

void cleanup(void) {

   mexPrintf("MEX-file is terminating, destroying array\n");

   cudaFree(Arg);

   cudaThreadExit();

   mexPrintf("Freed CUDA Array.\n");

}

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])

{

   int N;

   float *val;

if (!initialized) {

   /* Copy input value to GPU and register its cleanup. */

if ( !mxIsSingle(prhs[0]) ) {

           mexErrMsgTxt("Input value must be single precision.");

      }

mexPrintf("MEX-file initializing, creating GPU array and setting a value.\n");

       N=1;

       cudaMalloc ((void **)&Arg, N * sizeof(Arg[0]));

val =  (float*) mxGetData(prhs[0]);

/* Set the data of the array to some interesting value. */

//       val[0] = 37.0;

cudaMemcpy(Arg, val, N*sizeof(float), cudaMemcpyHostToDevice);

       mexPrintf("Set initial value to GPU; it is %g\n", val[0]);

mexAtExit(cleanup);

       initialized = 1;

} else {

val = (float*) mxMalloc(1);

       N=1;

       cudaMemcpy(val, Arg, N*sizeof(float), cudaMemcpyDeviceToHost);

       mexPrintf("Pulling the scalar off GPU; it is %g\n", val[0]);

       val[0]=val[0]++;

       cudaMemcpy(Arg, val, N*sizeof(float), cudaMemcpyHostToDevice);

       mexPrintf("Set new value to GPU; it is %g\n", val[0]);

mxFree(val);

}

}

Merely to note that the final example above has been cleaned up some more; extraneous code removed…