Cuda confusions a few clarifications on the programming methodology


With the active support of many members on the Nvidia forum , I’ve finally managed to graduate to the ‘programming forum’. Thanks to all for that :w00twave:

A few questions that have been swirling in my head and google being futile, this is where I resort

little background first:

As my first task I am trying to benchmark a vector addition of 1-dimension 20000(just signifying a big number) elements each on my system (i7 930 and 4 x GTX 295(8 GPUs)

a) using only GPUs (apart from the host calls)which I plan to distribute over all the 8 GPUs equally

b) using only processor( not relevant here)

  1. Quoted from Cuda programming guide version 1.1 section 3.4

Would running them in quad sli relieve me of the duties of assigning a section of code to a particular GPU and be taken care of by the master GPU?

2)Quoted from same book

would some one mind shedding clarification.

3)Coming to assigning specific GPU to a part of code the only thing available( in my knowledge is) cudaSetdevice. Now to split the task do I need to call the cudaSetDevice 8 times. What I mean is, will this is how the code should resemble?


myfunc<<<Nblocks,threadPerBlock>>>(param 1, param 2);


myfunc<<<Nblocks,threadPerBlock>>>(param 3, param 4);


myfunc<<<Nblocks,threadPerBlock>>>(param 6, param 5);






myfunc<<<Nblocks,threadPerBlock>>>(param x, param z);

and if this is the case does it imply that the cudaSetdevice for the second GPUand onwards is executed only after the completion of the kernel associated with first GPU , so on and so forth?

I have a couple more but then they better be addressed later if this post is not to be marked as spam owing to its length :)…

Thanking in anticipation for your time to read and respond to it…

You quote the Programming Guide version 1.1. Does this mean you’re still using CUDA 1.1 as well? In this case you might want to switch to CUDA 4.0 if possible, as you’ll understand in a moment.

  1. What this means is that, if your GPUs are SLI-connected, you can only use one of them AT ALL. Thankfully this limitation was removed in newer CUDA versions and you see a bunch of independent GPUs.
  2. If you launch a kernel it will immediately return. Thus any host computation you do afterwards will be performed in parallel with the device computation, until you directly synchronize or use functions with implicit synchronization (like cudaMemcpy).
  3. The code you outlined works, but only under CUDA 4.0. Pre-4.0 you could only set the device once and had to start multiple host threads, each controlling one device. As for the synchronization I’m not completely sure. cudaSetDevice is stated to synchronize with neither the previous nor the next device and each device has it’s own default stream, thus suggesting their execution should overlap. If I understand correctly all kernels should run in parallel in this case.

Thank you MarcusM :thanks:

I am using cuda toolkit 4.0 but started through the first programming guide as I found it more easier to understand as a beginner when it came to the initial few chapters.Sure I plan to move on to the next ones after its completion.

2)About SLI; now that i have toolkit 4.0 it should show as 8 devices right?( as what i understand of)

  1. The explanation of asynchrounous functions was bang on and helped dust off the doubt until the phrase

…some light, if you may? :)

4)The code prototype as given earlier need only be synchronised after the last kernel call?( after setting device to 7) as by then all other kernel threads(device 0 to 6) will definitely be in synchronised.

thanks again for your time and effort :)

Although the Programming Guide for CUDA 4.0 is much more voluminous than the one for CUDA 1.1, and may thus seem more dauting, CUDA has changed enough between versions 1.1 and 4.0 that using the Programming Guide for 1.1 could be highly misleading when working with CUDA 4.0. I definitely would not recommend this approach.

If you are looking for a different (possibly gentler) introduction, you may want to check whether the book “CUDA by Example” suits you better.

  1. Yes.
  2. If you use cudaMemcpy, it will automatically synchronize with the current device (, because the memory you try to copy might be in modification). If you want to circumvent this, you have to use cudaMemcpyAsync.
  3. No. You’d only synchronize with device 7 this way. You first start all kernels for them to run in parallel and, after all are running, synchronize with every single one. Fortunately you often won’t need to do this. See 3).

Its as if one can either spend days researching on net or post a topic here and get precise answers from the members in a day.This forum beats google :pirate: Thank you MarcusM and nJuffa…

An important part of the learning process was to time the kernel execution, also the program( idea being a breakdown report of time taken in executing a vector addition).
Almost all the SDK examples use the cutil library functions which I presume is not part of the standard toolkit(and just copying the header files and dll in repository before including it in code didn’t work for me :| ). I tried the cudaEventCreate and then cudaElapsedTime.

Even when I start the counter at the beginning of the host function and terminate it at the end the timer seem to be calculating only the time taken by device. Is there a way to time the entire code run, including the host.

(The host function allocates 3 arrays each of 50000 elements initializes them with data and then copies it onto the device.After addition the result is sent back to host)

I understand this may have gone a little off topic but didnt want to spam by creating another topic in forum :smile: :thanks: