the code sample is somewhat incomplete as we don’t have a definition of the SharedMemory<> template class.
This appears to implement a parallel reduction for one thread block, operating entirely in shared memory. It uses the operator “+”, so per thread block this generates the sum of blockDim.x consecutive double values read from g_idata, written to g_odata. The last thread block may consider less input values than blockDim.x due to the (i<n) conditional expression.
It’s somewhat inefficient as there is no loop unrolling done and the last 5 iterations don’t use any warp shuffles which would be slightly more efficient. Also the use of % operators certainly does not help (the code may end up performing costly divisions here)
it would also appear that the code will cause shared memory bank conflicts, as is.
The above PDF details the steps necessary to end up with a more efficient implementation.