MATLAB + CUDA parallelization using cuda on matlab program

OBJECTIVE: We are going to parallelize a MATLAB program using CUDA.
PROBLEM: We don’t have any background on matlab and cuda…
as in zero knowledge so we’re really browsing the net for ebooks and reading stuff.

My groupmates and I are already done with all the configuration of MATLAB and CUDA in order for it to work properly. It is already working very fine. The problem is… I dunno where to start…

Can u guys give me at least an idea on how to start this thing?

My plan is to find the parts of the MATLAB code to be parallelized — but i dont know how to choose which to manipulate… any help?

and can u give us tips if we’re on the right track…

See, that’s a big problem! Learn about Matlab, then learn about CUDA, and then combining the two is probably pretty straightforward.

(we don’t even know what you want to do besides “CUDAfy something related to Matlab,” which is not much of a question)

Well, depending on what you’re trying to parallelize, you may want to look at the Accelereyes Jacket product…that will take care of parallelizing things like matrix multiplication, fft’s, etc. (i.e. functionality already built into matlab).

If you’re trying to parallelize a custom algorithm, you will need to create a MEX file with CUDA support. This is basically a C program that is able to read/write data from matlab, so it can process the data however it likes (e.g. parallel calculations with CUDA). There is a fair amount of documentation available here on the forums and also on the nVidia website about doing this; I believe you will need to write C code (and import some matlab-specific headers…more info on the MathWorks site about writing MEX files) and then compile it with nvmex (provided by nvidia).

EDIT: Also, you’ll need to examine your algorithm to see if it can even benefit from parallelization…there are many algorithms out there that are mostly serial, and can’t be parallelized. In this case, you can either run the serial algorithm on multiple pieces of data (in parallel), or find a better (read: parallel) algorithm that suits your problem, and implement that instead.

I wrote a small tutorial for just this sort of situation (I’ve been in that boat already…). See: