Matlab and CUDA: A Tutorial Very basic, zero-order introduction

As a result of my past couple of week’s work with CUDA ( a lot of :argh: ) I’ve written up my notes in a very basic 22pp tutorial, using example codes. Much of the material is on these fora, but rather scattered around. Perhaps people will find the tutorial useful.

The Nvidia matlab package, while impressive, seems to me to rather miss the mark for a basic introduction to CUDA on matlab.

Happy to hear back from people with corrections and suggestions; it’s meant to be an evolving document.

(Tutorial revised 6/26/08 - cleanup, corrections, and modest additions)

(Tutorial revised again 8/19/08 - minor additions)

Changed to external link 9/16/09:…torial_8_08.pdf
(Nvidia attachment seems to have been lost, alas.)

(Tutorial revised again 2/12/10 - minor additions)…torial_2_10.pdf

May God Bless You!

I’ve cleaned up the tutorial document a bit, editorial corrections, clarified things here and there and added a new section on array dimensioning conventions. The file for download in the entry that started this thread has been updated. Cheers!

excellent tutorial, but I have to disagree on your motivation, I guess you underestimate the importance of MPI here. As long as our are on a single machine, you’re right )replace MPI with OpenMP or hand-crafted pthresds) . But there is no way of using CUDA instead of MPI for distributed memory clusters.

Thanks for the reply. I may not have written that paragraph quite right, but I think we are in agreement. One needs to use the proper tool for the job - in many ways CUDA and MPI are complementary, yin and yang, etc. That was what I intended to say.

Great tutorial!

I just wanted to point out that we are in the process of building a full CUDA engine for MATLAB programs (named Jacket) that may be of interest to people who read this thread. We just launched a free beta release that you can grab at:

We would love to hear your thoughts regarding Jacket and insights on how it can be improved to make MATLAB GPU Computing as beneficial to the community as possible.


John Melonakos

I’ll add here a few notes that I may include in the next version of the tutorial (call them a “patch” to the tutorial for now, if you like, or a TO DO list):

-> On the other mail list, I inquired about using cudaMallocHost in a mex file to speed up the host-device communications. This is to use “pinned memory”. Matlab is fairly memory sensitive, so it is not generally possible to allocate memory this way. I suspect, however, that one could actually sometimes get away with using cudaMallocHost() in a mex file - it is a matter of avoiding any calls to matlab. So one could cudaMallocHost(), compute away, and then clear out the CUDA variables before Matlab is aware of what is going on and complain/crash (its the “don’t ask, don’t tell” memory policy).

-> I was puzzled over why my “surf” command wasn’t working. Well, it would work, but nothing would appear in the figure. It develops that surf does not work on single precision data! (at least for me) If A is single precision, then one needs to do “surf(X,Y,double(A))” to see the data. I think this qualifies as a bug in Matlab, if you ask me.

-> C and CUDA are “row major” in how data are organized in arrays, Matlab and cublas are “column major” (nomenclature).

-> I think I need to say something about the grids/blocks/threads/warps and kernel efficiency, but I don’t think I altogether understand those quite yet…as integral to CUDA as they are…


when I try to compile nvmex example, I get this error:

Where can I find this file?


I have exactly the same problem! :argh: …

I don’t know if this is any help, but such a file does not exist on my linux machine. I also do not have an; rather, I have nvmex a bash script (rather than this perl script). is a perl module, I believe. I suspect you will have to look to matlab/mathworks for some sort of perl package on windows? (I presume you’ve searched your machine for this file already, so it is not a matter of having the right search path.)

This link any help?:…lution=1-1TNK6Y

Its fine, thank you. I solved this bug, is a Perl module supplied by Matlab. Apparently my instalation was corrupted and rather old. Upgrading it got the file back in matlab/bin. My bad, sorry.

I’ve developed my small HOWTO a little bit according to the notes above. The main addition is the discussion of multiple processors, warps and threads - I am still a little uncertain about those topics, so I’d be happy to hear of any corrections to misperceptions/poor discussion in the document. I aim for clarity above all things…

I’ve added the link to the latest version above in the first entry of this thread, but here it is also:…pe=post&id=9257

Other than to make any corrections that anyone might suggest, I think I am done with this document for the foreseeable future. I hope people find it useful.

Thanks for your work !
It seems to be very usefull for a CUDA-Noob like me !

It’s not sticky yet ? :D

I am bumping this thread to say that I’ve updated this tutorial somewhat:

  • Mention of Fermi and Tesla
  • Mention of CULA & MAGMA
  • Mention of AccelerEyes and GP-You
  • Memory management discussion
  • Profiler no longer a separate download/install
  • Some reorganization and update of CUDA distribution file names…torial_2_10.pdf (also listed at the top of the thread)

As always, happy to hear of suggestions of things to correct, add, or develop (insofar as I can).

Quick note, your guide talks about Fermi in the second paragraph, and also that Tesla is a compute-only device with no display. But the Fermi based Tesla S2050 and S2070 do have display output.

I am not a Matlab/CUDA user so I had two initial FAQs which I didn’t see the answer to.

  1. Are GPU accelerated routines available for commercial Matlab only, or is there also support for Octave, a popular open source Matlab clone?
  2. Are GPU accelerated routines available in Windows? Any other platforms than Linux?

I’ll fix the first correction. As for octave or Windows…I can’t really answer these questions properly. I haven’t the time to chase down these things, but need to focus on my own project. That said, I believe that people are using CUDA/matlab in the windows environment. I believe that octave can employ mex files or their equivalent - but will it work with CUDA?

If anyone would like to post some answers to these questions, I’d be happy to try to work them into the document.

I can attest that Matlab works with CUDA on both windows and linux.

I have once written up a small example of how to use CUDA within matlab, using cmake so you can keep the same source and build-system on both. (

As for octave: never tried it with CUDA.