about shared memory's contribution to performance when global memory access is coalesced

Hi everyone:

I has two kernels, which are the same except an operation of loading from the global memory and then making a multiplication(marked using bold). I timed this 2 kernels, which displayed that the kernel using the shared memory is 1/4 faster than the other. The following is the defination of these 2 kernels.

__global__ void

sum1(float *d_C, float *d_A, float *d_B, int ElementN)


     __shared__ float accumResult[ELEMENT_N];

     //Current vectors bases

     float *A = d_A + ElementN * blockIdx.x;

     float *B = d_B + ElementN * blockIdx.x;

     int tx = threadIdx.x;

<b>accumResult[tx] = A[tx] * B[tx];</b>

for(int stride = ElementN /2; stride > 0; stride >>= 1)



         if(tx < stride)

         accumResult[tx] += accumResult[stride + tx];


     d_C[blockIdx.x] = accumResult[0];


According to the CUDA docs, if the global memory access is coalesced, 16 independent memory transactions will be merged into 1 memory transaction, which will result in highly efficient memory access. But what is the shared memory’s contribution to the performance if the memory access has coalesced? Under such a condition, why is there a performance difference between using shared memory and absence of shared memory? These are my puzzle, can anybody help me figure them out?

Thanks in advance!

Seems like both the codes are same.

no, “accumResult[tx] = A[tx] * B[tx];” in sum1() is replaced by “accumResult[tx] = A[tx]; accumResult[tx] *= B[tx];” in sum2(), the others are indeed the same.