Hi,
After a Parallel Reduction I need to sum the elements of the partial reduction performed in each block.(the algorithm I am using the same code implemented in SDK)
I dont want to write a kernel(there aren’t so many elements).
Now I am copying all the elements from global memory to host memory and after I perform the sum with a for loop:
[codebox]…
//sup0 contain the elements to sum
cudaMemcpy(beta_n,sup0,numBlocks*sizeof(float),cudaMemcpyDev
iceToHost);
beta=0;
//compute norm after reduction
for (int i=0; i<numBlocks; i++){
beta += beta_n[i];
}
//beta contain the sum of the element
…
[/codebox]
Have you any suggestion about how to perform the sum without copying element in the host memory and without writing another kernel?
Can I perform the sum inside the Parallel Reducton’s kernel(I should share the partial sum between the blocks)?
Many thanks!