about shared memory's contribution to performance when global memory access is coalesced

Gimurk · July 12, 2011, 1:39am

Hi everyone:

I has two kernels, which are almost the same except an operation of loading from the global memory and then making a multiplication(bold text). I timed this 2 kernels, which displayed that the kernel using the shared memory is 1/4 faster than the other. The following is the defination of these 2 kernels.

global void

sum1(float *d_C, float *d_A, float *d_B, int ElementN)

{

 __shared__ float accumResult[ELEMENT_N];

 //Current vectors bases

 float *A = d_A + ElementN * blockIdx.x;

 float *B = d_B + ElementN * blockIdx.x;

 int tx = threadIdx.x;

accumResult[tx] = A[tx] * B[tx];

 for(int stride = ElementN /2; stride > 0; stride >>= 1)

 {

     __syncthreads();

     if(tx < stride)

     accumResult[tx] += accumResult[stride + tx];

 }

 d_C[blockIdx.x] = accumResult[0];

}

global void

sum2(float *d_C, float *d_A, float *d_B, int ElementN)

{

 __shared__ float accumResult[ELEMENT_N];

//Current vectors bases

 float *A = d_A + ElementN * blockIdx.x;

 float *B = d_B + ElementN * blockIdx.x;

 int tx = threadIdx.x;

[b]accumResult[tx] = A[tx];

 accumResult[tx] *= B[tx];[/b]

 for(int stride = ElementN /2; stride > 0; stride >>= 1)

 {

     __syncthreads();

     if(tx < stride)

     accumResult[tx] += accumResult[stride + tx];

 }

 d_C[blockIdx.x] = accumResult[0];

}

According to the CUDA docs, if the global memory access is coalesced, 16 independent memory transactions will be merged into 1 memory transaction, which will result in highly efficient memory access. But what is the shared memory’s contribution to the performance if the memory access has coalesced? Under such a condition, why is there a performance difference between using shared memory and absence of shared memory? These are my puzzle, can anybody help me figure them out?

Thanks in advance!