scatter and gather with CUDA?

sundog314 · March 8, 2009, 3:05am

Hey, I’m new to CUDA programming, and I have a question for the gurus out there…how does one implement a gather operation in CUDA? For example, say I have N threads per block and M blocks per grid. Each thread calculates a single contribution to a variable’s value, and the results of all N threads are summed into the final result, one for each of the M blocks in the grid. Anyone have any advice for an eager newbie? Thanks!

Jamie_K · March 8, 2009, 9:06am

One way, which may or may not be efficient is:
global gather(float *results) {
shared float values[BLOCKSIZE];
values[threadIdx.x] = calculate(threadIdx.x); // calculate in parallel
__syncthreads();
if (threadIdx.x == 0) { // a single thread calculates sum
for (int i=1; i < BLOCKSIZE; i++) {
values[0] += values[i];
}
results[blockIdx.x] = values[0];
}
}

Or probably much, much better is to use a reduction pattern, an example of which is provided in the SDK.

seibert · March 8, 2009, 8:41pm

Since you are only summing values within a block, the fastest approach will be a parallel reduction in shared memory. (Good use of shared memory makes CUDA really shine as accessing it is hundreds of times faster than global memory.) Mark Harris has a good tutorial on how parallel reduction works and how to do it fast in CUDA:

http://www.gpgpu.org/sc2007/SC07_CUDA_5_Op…tion_Harris.pdf

This is a great read in general, and the parallel reduction stuff starts on slide 39.

That said, when I’m in a hurry and just want to verify things are working, I’ve been known to use the “thread-0 sum loop” Jamie K showed above. However, when the time comes to speed up your CUDA kernel, replacing that loop with a proper parallel reduction should be high on your to-do list. :)

sundog314 · March 9, 2009, 3:44pm

Thank you both for your replies. I think I will take your advice seibert and try the quick-and-dirty sum loop for now, then optimize later. Like I said, I’m new to CUDA, but I’ve got access to some sweet hardware and I have some hugely parallel code that I’m excited to get up and running. Thanks!

Topic		Replies	Views
Thread cooperative addition CUDA Programming and Performance	1	1635	June 3, 2008
Implementing Gather in Cuda CUDA Programming and Performance	0	848	May 4, 2013
total sum example CUDA Programming and Performance	3	7185	December 2, 2015
Summing matrix elements CUDA Programming and Performance	3	6930	July 4, 2011
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8031	March 12, 2008
Array Sum in cuda CUDA Programming and Performance	5	11478	May 30, 2010
Cuda Speedup CUDA Programming and Performance	3	2235	October 20, 2009
Parallel addition CUDA Programming and Performance	4	3853	November 2, 2008
CUDA mean. CUDA Programming and Performance	1	3248	April 7, 2009
Parallel reduction problem CUDA Programming and Performance	1	5082	November 29, 2010

scatter and gather with CUDA?

Related topics