Code doubt

Hello guys! Im new in CUDA programming and i need some help in this function. What does it do?

double global void BAZINGA(double *g_idata, *g_odata, unsigned int n){
double sdata = SharedMemory();
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x
blockDim.x + threadIdx.x;
sdata[tid] (i<n) ? g_idata[i] : 0;

__syncthreads():

for (unsigned int s=1; s<blockDim.x; s*=2){
if ((tid % (2*s) == 0{
sdata[tid] +=sdata[tid+s];
}
__syncthreads():
}
if (tid==0) g_odata[blockIdx.x] = sdata[0];
}

Thank u very much!

the code sample is somewhat incomplete as we don’t have a definition of the SharedMemory<> template class.

This appears to implement a parallel reduction for one thread block, operating entirely in shared memory. It uses the operator “+”, so per thread block this generates the sum of blockDim.x consecutive double values read from g_idata, written to g_odata. The last thread block may consider less input values than blockDim.x due to the (i<n) conditional expression.

for a tutorial on this operation in CUDA, refer to this PDF
https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

It’s somewhat inefficient as there is no loop unrolling done and the last 5 iterations don’t use any warp shuffles which would be slightly more efficient. Also the use of % operators certainly does not help (the code may end up performing costly divisions here)

it would also appear that the code will cause shared memory bank conflicts, as is.

The above PDF details the steps necessary to end up with a more efficient implementation.

if you want to bring warp shuffles into the implementation, here is a tutorial on this: https://devblogs.nvidia.com/faster-parallel-reductions-kepler/

there is a CUDA code sample on parallel reduction shipped with the CUDA toolkit that applies a lot of the above concepts.

This is illegal:

double global void …

Thanks, cbuchner1! I’m going to read the tutorial.