Results from computations of several blocks

Hi there,

threads within the same block communicate through shared memory and threads of different blocks cannot intuitively communicate*.

How do I efficiently retrieve results from compuations of several blocks?

Having just 1 block, the following works fine:

__shared__ int matches;

//have each thread work through some data


if (found)



//make result available for host application

d_matches[0] = matches;

Obviously, this doesn’t work for more than 1 block.

While declaring the variable matches as a device variable would be one solution, there may be better ways of achieving the same result.

Thanks for any help on this,


*By writing the results of one block in kernel1 to global memory and have another block of kernel2 read them from there, even blocks can communicate (not very handy but need in some cases).

I’m sure some of the SDK examples will help you. Look for ones on reduction or summing or averaging an entire raster.

Oh your code above may not work correctly if two threads find a match at same time.
In these GPU’s 8 (or in newer 16) threads do exactly the same operation on the same clock cycle, matches++ is actually shorthand for several operations e.g.
1)load contents on matches into a register
2)do other things while memory is fetched
3)increment register
4)write results back to matches.
So if two threads ‘find’ something at same time. then they will both load ‘matches’ into a register, increment their register, and write it back.
2nd thread will overwrite result of 1st thread, not increment it.

What you need is for each thread to have a local variable that it increments, then when all have finished ‘work through some data’ sum the values of these variables together. Several ways of doing that, one is with AtomicAdd

By having each block add its total to a global variable using AtomicAdd you can do what you described with one kernel. Do look at the SDK examples.


Thanks for your comments.
I was aware of the problems the matches variable could cause and planned on using a share array… but working with atomicAdd is much better. Even though in my case, the worst case is rather unlikely. But: Better save than sorry. ;)

Very helpful!