Change source code? How exactly...

How exactly do you change the source code of a program to get it to work in CUDA? Does CUDA automatically map and split the program into multiple threads, or do you have to do it yourself? If you do, then how? Just for an example in pseudocode, will CUDA automatically split 23+104+123+2447+2855+236 into 23, 104, 123, 2447, 2855, 236 threads, then automatically do the sum in the most efficient way after retrieving the results, or do you have to set a special code to do that?

CUDA does nothing automagically. Even if it did, computing something like 23+104+123+2447+2855+236 in parallel would be a hopelessly inefficient task. GPUs are build to run a minimum of ~10 to 20 thousand threads concurrently.

To program in CUDA, you write data-parallel kernels and call them with a grid configuration. Each kernel is run in a separate thread (10’s of thousands to millions) executing the same instructions on different elements of data.

Here are two good sources for a more detailed introduction:
The lectures at
The CUDA programming guide at

First of all, I did that (using under 10 possible parallelizations) for simplicity. Think about it. Would you rather read



23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+2855+236+23+104+123+2447+…(repeat 10,000 times)?

Also, I’ve been reading about automatic parallelization, and I would like to know where I can find a free program that can transform ordinary source code (preferably C) into parallel source code (I don’t care what it is, as long as CUDA can work with it). I am not looking for a compiler; I can use CUDA for that. Using a seperate compiler would mean that I would need to decompile it right back to source code, which is a lengthy and inaccurate process.

You need to invest a bit more time in understanding what CUDA is. Why not install it and play around with the demos in the SDK?

Isn’t CUDA a C compiler that follows explicit (as opposed to implied) parallel cues in the source code? An auto-modification of the source code to add explicit cues from the implied cues should work with that.

Of course. In certain circumstances, it certainly should be possible to transform an apparent data-parallel for loop into a CUDA kernel automagically. The problem is that you are asking for an existing tool to do so, and we have already said that there is none. This is currently an area of academic research! Wen-mei Hwu (the teacher of the course I sent you the link to) is currently working on it.

I couldn’t pull up the course, sorry. Can you e-mail me the contents? My e-mail is (starcalc AT-SIGN gmail (removethisantispam) DOT-SIGN com). You can attach the saved MHT files (click SAVE when on the page, then save as mht web archive).