total sum example

is there any example on how to convert sum of all vector from sequential for loop to parallel sum? eg my input is 50k sample, and want to find the total sum in parallel coding on GPU.

seem like I need do a partial sum with x block and n thread, then take result from each block and run through GPU again as 1 block and x thread? which require 2 or more kernel, anyway to use only 1 kernel? so its sum of all thread then sum of all blocks?

I believe you are wanting a reduction. If that is what you have in mind, a good sample code and presentation are provided in the cuda samples:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction
https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

One possible approach is to do a “block-draining” reduction, this is covered in the threadfence reduction cuda sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#threadfencereduction

On Kepler and newer GPUs, you might also consider just using an atomic update at the end of each threadblock, if your reduction datatype is amenable, to create a single global reduction value. You can use this approach on Fermi as well, but Kepler introduced faster global atomics.

Note that libraries like thrust:

https://github.com/thrust/thrust/wiki/Quick-Start-Guide
https://thrust.github.io/doc/group__reductions.html

and cub:

https://nvlabs.github.io/cub/
https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html

(and others) provide ready-to-use reduction routines. So depending on your needs, it might be easier to use one of those rather than writing your own.

thx in the NVidia pdf, for the kernel code in this pdf for reduction #7 method

https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

what is the purpose of this code here, summing 1st data from different blk, but why when the code below already summing from thread.
while (i < n){
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}

I call this a grid-striding loop.

It allows a given grid size (i.e. the total number of threads in a grid) to process a larger data-set size.

So I can launch, for example, 100,000 threads, but fully process a data set consisting of 1,000,000 elements (or larger). The grid of threads “strides” across the data set.

It has some advantages:

  1. The grid size (and grid size calculations) can be decoupled from the data set size.
  2. The total number of threads can be optimized for the machine architecture.

This second point is a “small” optimization. New CUDA programmers should not pay undue attention to it or draw extensive conclusions from it. However, the basic idea is that once enough parallelism (i.e. threads) has been exposed to fully utilize the machine, exposing additional parallelism (creating more threads) generally does not improve performance, and may actually decrease performance slightly (a few percent?).

Therefore, we launch fewer threads, while still maximizing or optimizing for the machine capacity, and give each thread more work to do. For full optimization, you would probably want to tune the grid size for the machine archtitecture rather than the problem size as might be typical. So on a Kepler, you might want to launch more threads in your grid than on a Fermi, for example. More specifically, you might want to read some of the device properties at runtime, and make a decision about grid size based on those properties (number of SMs, max threads per SM, max blocks per SM, etc.)

Note that at an introductory level, we do not make grid size decisions like this. In particular, we do not set the number of threads equal to the total number of cores, or any such heuristic. This type of optimization can only be fully understood after a solid understanding of the nature of the latency-hiding process on GPUs is acquired.