A few questions CUDA Beginner

I’m working on a research grant to take a VBa program used to model energies between Isomers and a Stationary phase molecule.
The idea is to take the program from VBa to C++ and then to CUDA in order to speed up the process of gaining results. I currently have all of CUDA setup to compile correctly and run programs.

I have the program fully converted to C++ and I am beginning the step to CUDA, however I have a few questions that I can’t find answers for in the CUDA by example book or online. Also I was hoping for some suggestions on how to easily convert the program into a parallel solution.

The code is designed to run 1 Million or more iterations outputting the energy calculated and xyz/theta xyz positions. What I want to be able to do is have the program run multiple instances of the program all at once, while possibly speeding up each run through to some extent.

The program itself is currently smaller than 1000 lines, so it should not for the most part be very difficult to complete this step.
Some of these questions may be simple but I want to be sure that I am not confused before beginning.

Question 1. Is there a simple way to have CUDA spawn multiple instances of the program, each one running on separate blocks to utilize the whole system (C2050) with local variables to that instance in the end printing out the results. Basically taking the code already written and instead of just running one time 1 million iterations worth, having say 100 programs each running 1 million iterations all together.

Question 2. If I have multiple instances of the same code running in parallel how do I avoid variables getting over written, say my counter variables to keep track of the iteration that it is currently on, is the only way to solve that problem to use arrays in store the values in different memory locations?

Also any suggestions would be great to hear, as I said i’m very new to parallel programming, but have no problem picking up on new ideas or ways of working through code.

Thanks for reading!

I’m only a beginner to CUDA too, but check out CUDA By Example; that book should answer your questions along with providing nice examples of implementation. However I worry that having only read it recently, I might introduce errors into its explanations if I were to reiterate them :)

(Ok ok i’ll try:

  1. when you run a kernel on the GPU inside your host code [assume you have a function defined: global void Function(){}] you call it like this Function()<<<a,b>>>; where a is the number of blocks (thread containers) you’ll run and b is the number of threads per block. So each thread will act as its own program and you’ll have a*b total programs. Note that a <= 65,535 and b <= 512 (hardware restrictions). I asked someone a while ago and they said that the best value for b varies with application but 128 >= b >= 256 is a good place to start (but b must ALWAYS be a multiple of 32 or else you will read memory very slowly).

  2. inside device code you can call threadidx.x and blockidx.x (assuming 1D thread and block indices) and use those to have each thread work on unique pieces of memory.

hope that helps! (and is true))

So if run the program the number of threads would be the number of separate programs which are ran at once?
Does it create new instances of variables in each thread then to avoid overwriting variables from one instance to another?

That’s mostly what I’m worried about. If I run many threads of the same program and access global variables is it possible for the programs to access the same variable, possibly changing it from one to the other in the end messing up calculations.

Would a good fix to be create all local variables to that one thread?

You’ll run as many threads as your GPU has processors for at a time (behind-the-scenes stuff takes care of this queuing for you)

If you create a variable inside global void Function(){}, it should be exclusive to that function so you can do whatever you want to to it with no ill effects. You should do this for anything you want to change. But yeah, if you were to use cudaMalloc() to get a pointer to some variable on the GPU and then passed that pointer to each thread and let them all modify the contents of that memory zone without regard to each other, then yeah you could get in trouble. Though if you’re just dealing with local variables as far as I know there will be no problem. (code executed on the GPU shouldn’t even be able to see global variables made by the CPU by default I don’t think, not without designating them as zero-copy memory or something)

Awesome! Thanks a lot, this clears up some issues and allows me to continue moving on.

Remember that your threads are all going to be executing the same instructions, so if you do something like

for (i = 0; i < 1000000; i++)

do something with threadidx.x

the index variable is common to all the threads, which are all doing the same instruction in parallel.

“index variable is common to all threads”

depends on how i is declared.

if you use “for ( int i = 0; i < 1000000; i++)” then all threads will have their own variable “i” which will be fastest and results predictable

if you use

__shared__ int i

  for (i = 0; i < 1000000; i++)

then there will only be one “i” per block and in this case results will be unpredictable. because every thread is trying to update that single variable.

threads per block “must ALWAYS be a multiple of 32 or else you will read memory very slowly”

ShingleServ Jamesqf is refering to making reads/write from global memory (Device RAM) contiguous, this is definately a very desireable design goal, has a huge impact on performance. The programming guide covers contiguous IO well.

On Threads per block
One of the guru’s said something like “if you can, design your code so you can change the number of threads per block easily so you can find the optimum number for your kernel, but somewhere between 128 to 256 is usually a good place to start” #define is great for this
#define ThreadsPerBlock 192
shared float2 flow[ThreadsPerBlock];
int th = b * ThreadsPerBlock + tib;
dim3 dimBlock( ThreadsPerBlock );

Ideal number is a multiple of 32
Most applications can be designed for a multiple of 32, but you don’t have to have a multiple of 32, if there is something about your problem that really dictates having a different number.

Worst case is to have just over a multiple of 32. ( Lucky we dont have months with 33 days in them :) )


Wow great information! Thanks a lot guy, really helping me along.

Another question I have now relates to function calls within my kernel.
I have parts of the code which I want to run on device and other parts on host (basically output at the end of each iteration, very simple stuff)
I remember the book mentioning that simple calculations and some output is suggested to be done on the cpu.

however I have some functions defined as host and others as device but when it gets to the host functions it throws “error : calling a host function from a device/global function is not allowed” How would I go about calling the host functions inside the device functions?

I know i have to move the variables around in memory with malloc and memcpy but it seems like the idea of calling the host function within the device function is not something I can accomplish. How would I get around this?