scatter and gather with CUDA?

Hey, I’m new to CUDA programming, and I have a question for the gurus out there…how does one implement a gather operation in CUDA? For example, say I have N threads per block and M blocks per grid. Each thread calculates a single contribution to a variable’s value, and the results of all N threads are summed into the final result, one for each of the M blocks in the grid. Anyone have any advice for an eager newbie? Thanks!

One way, which may or may not be efficient is:
global gather(float *results) {
shared float values[BLOCKSIZE];
values[threadIdx.x] = calculate(threadIdx.x); // calculate in parallel
__syncthreads();
if (threadIdx.x == 0) { // a single thread calculates sum
for (int i=1; i < BLOCKSIZE; i++) {
values[0] += values[i];
}
results[blockIdx.x] = values[0];
}
}

Or probably much, much better is to use a reduction pattern, an example of which is provided in the SDK.

Since you are only summing values within a block, the fastest approach will be a parallel reduction in shared memory. (Good use of shared memory makes CUDA really shine as accessing it is hundreds of times faster than global memory.) Mark Harris has a good tutorial on how parallel reduction works and how to do it fast in CUDA:

http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf

This is a great read in general, and the parallel reduction stuff starts on slide 39.

That said, when I’m in a hurry and just want to verify things are working, I’ve been known to use the “thread-0 sum loop” Jamie K showed above. However, when the time comes to speed up your CUDA kernel, replacing that loop with a proper parallel reduction should be high on your to-do list. :)

Thank you both for your replies. I think I will take your advice seibert and try the quick-and-dirty sum loop for now, then optimize later. Like I said, I’m new to CUDA, but I’ve got access to some sweet hardware and I have some hugely parallel code that I’m excited to get up and running. Thanks!