A bit more trial-and-error computing later, and I can report the following pointers (no pun intended…) when using CUDA with matlab mex files:
When using loops in matlab that call a mex file, or loops within the mex file, there are two memory “leaks” to be aware of.
The first is that when calling a matlab function from a mex file like:
mexCallMATLAB(1,&lhs[0],2,rhs,"mrdivide");
matlab does not overwrite the lhs pointer with each call, but just keeps allocating more space. If you loop enough, you’ll fill up all your computer’s memory. So, after calling this from my CUDA-using mex file to do the matrix division, I copy the result to the GPU and then immediately destroy the array:
mxDestroyArray(lhs[0]);
which does not make lhs go away, but clears out the allocated memory. No more host memory leak.
The second is that when allocating space on the GPU in a mex file, e.g.,
cudaMalloc ((void **)&A, N * sizeof(A[0]));
that allocated space is not cleared when the mex call terminates. So if the mex file is repeatedly called from a loop in matlab, the GPU memory will fill up. You have to be sure to clear it with:
cudaFree(A);
at the end of the mex file. When the call to my mex file was made repeatedly from a loop in matlab, and I noticed that eventually I got erroneous results until I cleared the GPU allocations. No more GPU memory leak. (Luckily I had an 8800GT with 1 GB of RAM that ran correctly to completion, while my 8800GT with 0.5 GB of RAM failed…hummmm…)
The latter point suggests that it should indeed be possible to call a mex file, allocated/calculate on the GPU, and then have those results available again already on the GPU when the mex file is called again. But how to code that…I dunno. A matter of preserving the pointers to the GPU allocated space, I presume.
Perhaps these tips may be useful to someone, although they look obvious (sort of) to me now.
(We need a matlab-dedicated forum and a CUDA documentation wiki!)