Need help with summing results from different blocks

Hello everyone,

I need some help with a CUDA kernel. I am not sure what is the best way and I would like to avoid copying data back to host and adding them up, if possible.

So each thread in my kernel produces a floating point output and what I need to do is add all these outputs up. So, the kernel looks as follows:


device void DoCalc(float * result)



global void mykernel(int * num_blocks)

const int tid = (blockIdx.x * blockDim.x + threadIdx.x) + (blockIdx.y * gridDim.x);

if (tid < (*num_blocks)){

    __shared__ float inc_result[512];

    result[threadIdx.x] = 0.0f;





After this syncthreads command, I would like to basically add the results from all the threads in ALL the blocks. How can I achieve this in the best possible way? Do I have to copy the data back to the host and launch another kernel?

Thanks for any help you can give me.


Look up scan (parallel prefix sum) in the SDK.


Here are a couple of things:

  1. I’d pass num_blocks to the kernel as a simple int - much simpler.

  2. In order to sum the calculated values per block (i.e. the sum of ALL threads within the SAME block) use reduction. See the reduction sample in the SDK for it.

  3. To sum the results of all the blocks - either write each block’s result to a different cell in an output array (the size of the blocks…) and then sum it on the CPU or add it

    via atomicAdd to thread-safe add the values calculated by each block. You can also launch a second kernel to sum up the different blocks results and write the single value to the memory for

    the CPU host code to read.

Hope that helped.


This is very helpful. Thank you!