Sum few elements

Hi,

After a Parallel Reduction I need to sum the elements of the partial reduction performed in each block.(the algorithm I am using the same code implemented in SDK)

I dont want to write a kernel(there aren’t so many elements).

Now I am copying all the elements from global memory to host memory and after I perform the sum with a for loop:

[codebox]…

//sup0 contain the elements to sum

cudaMemcpy(beta_n,sup0,numBlocks*sizeof(float),cudaMemcpyDev

iceToHost);

	beta=0;

	//compute norm after reduction

	for (int i=0; i<numBlocks; i++){

		beta += beta_n[i];          

	}

//beta contain the sum of the element

[/codebox]

Have you any suggestion about how to perform the sum without copying element in the host memory and without writing another kernel?

Can I perform the sum inside the Parallel Reducton’s kernel(I should share the partial sum between the blocks)?

Many thanks!