Sum few elements


After a Parallel Reduction I need to sum the elements of the partial reduction performed in each block.(the algorithm I am using the same code implemented in SDK)

I dont want to write a kernel(there aren’t so many elements).

Now I am copying all the elements from global memory to host memory and after I perform the sum with a for loop:


//sup0 contain the elements to sum




	//compute norm after reduction

	for (int i=0; i<numBlocks; i++){

		beta += beta_n[i];          


//beta contain the sum of the element


Have you any suggestion about how to perform the sum without copying element in the host memory and without writing another kernel?

Can I perform the sum inside the Parallel Reducton’s kernel(I should share the partial sum between the blocks)?

Many thanks!