Need help with summing results from different blocks

xargon · May 10, 2010, 2:17pm

Hello everyone,

I need some help with a CUDA kernel. I am not sure what is the best way and I would like to avoid copying data back to host and adding them up, if possible.

So each thread in my kernel produces a floating point output and what I need to do is add all these outputs up. So, the kernel looks as follows:

[codebox]

device void DoCalc(float * result)

{

}

global void mykernel(int * num_blocks)

const int tid = (blockIdx.x * blockDim.x + threadIdx.x) + (blockIdx.y * gridDim.x);

if (tid < (*num_blocks)){

    __shared__ float inc_result[512];

    result[threadIdx.x] = 0.0f;

    DoCalc(result[threadIdx.x]);

    __syncthreads();

}

[/codebox]

After this syncthreads command, I would like to basically add the results from all the threads in ALL the blocks. How can I achieve this in the best possible way? Do I have to copy the data back to the host and launch another kernel?

Thanks for any help you can give me.

Luca

dmsmall · May 10, 2010, 2:25pm

Look up scan (parallel prefix sum) in the SDK.

eyalhir74 · May 10, 2010, 2:28pm

Hello everyone,

I need some help with a CUDA kernel. I am not sure what is the best way and I would like to avoid copying data back to host and adding them up, if possible.

So each thread in my kernel produces a floating point output and what I need to do is add all these outputs up. So, the kernel looks as follows:

[codebox]

device void DoCalc(float * result)

{

}

global void mykernel(int * num_blocks)

const int tid = (blockIdx.x * blockDim.x + threadIdx.x) + (blockIdx.y * gridDim.x);

if (tid < (*num_blocks)){
    __shared__ float inc_result[512];

    result[threadIdx.x] = 0.0f;

    DoCalc(result[threadIdx.x]);

    __syncthreads();
}

[/codebox]

After this syncthreads command, I would like to basically add the results from all the threads in ALL the blocks. How can I achieve this in the best possible way? Do I have to copy the data back to the host and launch another kernel?

Thanks for any help you can give me.

Luca

Hi,

Here are a couple of things:

I’d pass num_blocks to the kernel as a simple int - much simpler.
In order to sum the calculated values per block (i.e. the sum of ALL threads within the SAME block) use reduction. See the reduction sample in the SDK for it.
To sum the results of all the blocks - either write each block’s result to a different cell in an output array (the size of the blocks…) and then sum it on the CPU or add it

via atomicAdd to thread-safe add the values calculated by each block. You can also launch a second kernel to sum up the different blocks results and write the single value to the memory for

the CPU host code to read.

Hope that helped.

eyal

xargon · May 10, 2010, 3:11pm

This is very helpful. Thank you!

Luca

Topic		Replies	Views
Question regarding summing up outputs Summing outputs from each thread CUDA Programming and Performance	10	8027	March 12, 2008
Interpretation of Kernel CUDA Programming and Performance	4	3082	August 11, 2009
Results from computations of several blocks CUDA Programming and Performance	2	559	April 28, 2011
Easyway to compute the sum of the array? CUDA Programming and Performance	4	8015	February 13, 2008
Combining sums CUDA Programming and Performance	1	1222	November 27, 2008
Parallel Addition ? How can i serialize parts at kernel? CUDA Programming and Performance	4	2934	August 16, 2009
computing a sum leads to infinite values CUDA Programming and Performance	3	5375	September 16, 2008
More than 512 threads sync. CUDA Programming and Performance	4	2840	June 3, 2009
scatter and gather with CUDA? CUDA Programming and Performance	3	9785	March 9, 2009
accumulating floats accross threads in a block is there and atomicAdd + sync for floats? CUDA Programming and Performance	1	2229	January 26, 2009

Need help with summing results from different blocks

Related topics