total sum example

s002wjh · December 1, 2015, 3:39pm

is there any example on how to convert sum of all vector from sequential for loop to parallel sum? eg my input is 50k sample, and want to find the total sum in parallel coding on GPU.

seem like I need do a partial sum with x block and n thread, then take result from each block and run through GPU again as 1 block and x thread? which require 2 or more kernel, anyway to use only 1 kernel? so its sum of all thread then sum of all blocks?

Robert_Crovella · December 1, 2015, 3:52pm

I believe you are wanting a reduction. If that is what you have in mind, a good sample code and presentation are provided in the cuda samples:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction
https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

One possible approach is to do a “block-draining” reduction, this is covered in the threadfence reduction cuda sample code:

http://docs.nvidia.com/cuda/cuda-samples/index.html#threadfencereduction

On Kepler and newer GPUs, you might also consider just using an atomic update at the end of each threadblock, if your reduction datatype is amenable, to create a single global reduction value. You can use this approach on Fermi as well, but Kepler introduced faster global atomics.

Note that libraries like thrust:

https://github.com/thrust/thrust/wiki/Quick-Start-Guide
https://thrust.github.io/doc/group__reductions.html

and cub:

https://nvlabs.github.io/cub/
https://nvlabs.github.io/cub/structcub_1_1_device_reduce.html

(and others) provide ready-to-use reduction routines. So depending on your needs, it might be easier to use one of those rather than writing your own.

s002wjh · December 2, 2015, 3:52pm

thx in the NVidia pdf, for the kernel code in this pdf for reduction #7 method

https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf

what is the purpose of this code here, summing 1st data from different blk, but why when the code below already summing from thread.
while (i < n){
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}

Robert_Crovella · December 2, 2015, 4:37pm

I call this a grid-striding loop.

It allows a given grid size (i.e. the total number of threads in a grid) to process a larger data-set size.

So I can launch, for example, 100,000 threads, but fully process a data set consisting of 1,000,000 elements (or larger). The grid of threads “strides” across the data set.

It has some advantages:

The grid size (and grid size calculations) can be decoupled from the data set size.
The total number of threads can be optimized for the machine architecture.

This second point is a “small” optimization. New CUDA programmers should not pay undue attention to it or draw extensive conclusions from it. However, the basic idea is that once enough parallelism (i.e. threads) has been exposed to fully utilize the machine, exposing additional parallelism (creating more threads) generally does not improve performance, and may actually decrease performance slightly (a few percent?).

Therefore, we launch fewer threads, while still maximizing or optimizing for the machine capacity, and give each thread more work to do. For full optimization, you would probably want to tune the grid size for the machine archtitecture rather than the problem size as might be typical. So on a Kepler, you might want to launch more threads in your grid than on a Fermi, for example. More specifically, you might want to read some of the device properties at runtime, and make a decision about grid size based on those properties (number of SMs, max threads per SM, max blocks per SM, etc.)

Note that at an introductory level, we do not make grid size decisions like this. In particular, we do not set the number of threads equal to the total number of cores, or any such heuristic. This type of optimization can only be fully understood after a solid understanding of the nature of the latency-hiding process on GPUs is acquired.

Topic		Replies	Views
Parallel Reduction CUDA Programming and Performance	2	1220	July 8, 2010
Parallel Addition ? How can i serialize parts at kernel? CUDA Programming and Performance	4	2992	August 16, 2009
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	916	July 16, 2010
Parallel addition CUDA Programming and Performance	4	3960	November 2, 2008
Easyway to compute the sum of the array? CUDA Programming and Performance	4	8116	February 13, 2008
newb question parallel add array in cuda CUDA Programming and Performance	1	3903	October 28, 2008
Array Sum in cuda CUDA Programming and Performance	5	11586	May 30, 2010
sum reduction Legacy PGI Compilers	3	3307	August 29, 2017
sequential sum within a kernel. CUDA Programming and Performance	23	5266	September 8, 2008
scatter and gather with CUDA? CUDA Programming and Performance	3	10421	March 9, 2009

total sum example

Related topics