 # total sum example

is there any example on how to convert sum of all vector from sequential for loop to parallel sum? eg my input is 50k sample, and want to find the total sum in parallel coding on GPU.

seem like I need do a partial sum with x block and n thread, then take result from each block and run through GPU again as 1 block and x thread? which require 2 or more kernel, anyway to use only 1 kernel? so its sum of all thread then sum of all blocks?

I believe you are wanting a reduction. If that is what you have in mind, a good sample code and presentation are provided in the cuda samples:

One possible approach is to do a “block-draining” reduction, this is covered in the threadfence reduction cuda sample code:

On Kepler and newer GPUs, you might also consider just using an atomic update at the end of each threadblock, if your reduction datatype is amenable, to create a single global reduction value. You can use this approach on Fermi as well, but Kepler introduced faster global atomics.

Note that libraries like thrust:

and cub:

(and others) provide ready-to-use reduction routines. So depending on your needs, it might be easier to use one of those rather than writing your own.

thx in the NVidia pdf, for the kernel code in this pdf for reduction #7 method

what is the purpose of this code here, summing 1st data from different blk, but why when the code below already summing from thread.
while (i < n){
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}

I call this a grid-striding loop.

It allows a given grid size (i.e. the total number of threads in a grid) to process a larger data-set size.

So I can launch, for example, 100,000 threads, but fully process a data set consisting of 1,000,000 elements (or larger). The grid of threads “strides” across the data set.