questions about GPU parallel reduction variant

xiaoxi · June 27, 2011, 3:54am

Nvidia parallel reduction example http://developer.dow…c/reduction.pdf

In this example: we have p1, p2, p3, …p8, This function is only for calculating P1+P2+P3…+P8, the final result is P12345678, P7531, p62, p51, p4, p3, p2, p1 using warp,

But if we need get all results of p1+p2+p3…+p8, p1+p2+p3+…P7, p1+p2+p3…+p6, …p2+p1, p1, do you have any suggestion? How to use warp to do this?

I think we can change reduce 5, remove volatile, and add __syncthread() for (tid < 32), any other idea? Can we use kernel 6 in Nvidia example?

template <class T, unsigned int blockSize>
global void
reduce5 (T *g_idata, T *g_odata, unsigned int n)
{
T *sdata = SharedMemory();

// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;

T mySum = (i < n) ? g_idata[i] : 0;
if (i + blockSize < n) 
    mySum += g_idata[i+blockSize];  

sdata[tid] = mySum;
__syncthreads();

// do reduction in shared mem
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }

#ifndef DEVICE_EMULATION
if (tid < 32)
#endif
{
// now that we are using warp-synchronous programming (below)
// we need to declare our shared memory volatile so that the compiler
// doesn’t reorder stores to it and induce incorrect behavior.
volatile T* smem = sdata;
if (blockSize >= 64) { smem[tid] = mySum = mySum + smem[tid + 32]; EMUSYNC; }
if (blockSize >= 32) { smem[tid] = mySum = mySum + smem[tid + 16]; EMUSYNC; }
if (blockSize >= 16) { smem[tid] = mySum = mySum + smem[tid + 8]; EMUSYNC; }
if (blockSize >= 8) { smem[tid] = mySum = mySum + smem[tid + 4]; EMUSYNC; }
if (blockSize >= 4) { smem[tid] = mySum = mySum + smem[tid + 2]; EMUSYNC; }
if (blockSize >= 2) { smem[tid] = mySum = mySum + smem[tid + 1]; EMUSYNC; }
}

// write result for this block to global mem 
if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

brano · June 27, 2011, 7:20am

Nvidia parallel reduction example http://developer.dow…c/reduction.pdf

In this example: we have p1, p2, p3, …p8, This function is only for calculating P1+P2+P3…+P8, the final result is P12345678, P7531, p62, p51, p4, p3, p2, p1 using warp,

But if we need get all results of p1+p2+p3…+p8, p1+p2+p3+…P7, p1+p2+p3…+p6, …p2+p1, p1, do you have any suggestion? How to use warp to do this?

I think we can change reduce 5, remove volatile, and add __syncthread() for (tid < 32), any other idea? Can we use kernel 6 in Nvidia example?

template <class T, unsigned int blockSize>

global void

reduce5 (T *g_idata, T *g_odata, unsigned int n)

{
T *sdata = SharedMemory<T>();
// perform first level of reduction,
// reading from global memory, writing to shared memory

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
T mySum = (i < n) ? g_idata[i] : 0;
if (i + blockSize < n) 

    mySum += g_idata[i+blockSize];  
sdata[tid] = mySum;
__syncthreads();
// do reduction in shared mem
if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }
#ifndef DEVICE_EMULATION
if (tid < 32)
endif
{

    // now that we are using warp-synchronous programming (below)

    // we need to declare our shared memory volatile so that the compiler

    // doesn't reorder stores to it and induce incorrect behavior.

    volatile T* smem = sdata;

    if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; EMUSYNC; }

    if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; EMUSYNC; }

    if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; EMUSYNC; }

    if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; EMUSYNC; }

    if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; EMUSYNC; }

    if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; EMUSYNC; }

}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

Hi,

I posted a link to you in another thread.

Here is another one you could use:

http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

You are looking for a prefix scan algo. not a compaction.

Topic		Replies	Views
Parallel Reduction CUDA Programming and Performance	2	1178	July 8, 2010
Reduction Problem CUDA Programming and Performance	2	2119	May 9, 2010
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6544	April 13, 2009
Best way to pack bits into words for global memory Better than reduce in shared memory? CUDA Programming and Performance	17	6736	June 2, 2012
CUDA FORTRAN shared memory warp-level sum reduction Legacy PGI Compilers	1	3406	May 19, 2014
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2335	August 12, 2009
I want to ask parallel reduction.... CUDA Programming and Performance	0	490	January 30, 2019
Parallel Addition ? How can i serialize parts at kernel? CUDA Programming and Performance	4	2956	August 16, 2009
ask help about the SDK demo: reduction CUDA Programming and Performance	5	1228	March 31, 2010
Possible bug in reduction routine from the CUDA 7 SDK CUDA Programming and Performance	3	653	September 13, 2017

questions about GPU parallel reduction variant

Related topics