Hi everyone:
I has two kernels, which are the same except an operation of loading from the global memory and then making a multiplication(marked using bold). I timed this 2 kernels, which displayed that the kernel using the shared memory is 1/4 faster than the other. The following is the defination of these 2 kernels.
__global__ void
sum1(float *d_C, float *d_A, float *d_B, int ElementN)
{
__shared__ float accumResult[ELEMENT_N];
//Current vectors bases
float *A = d_A + ElementN * blockIdx.x;
float *B = d_B + ElementN * blockIdx.x;
int tx = threadIdx.x;
<b>accumResult[tx] = A[tx] * B[tx];</b>
for(int stride = ElementN /2; stride > 0; stride >>= 1)
{
__syncthreads();
if(tx < stride)
accumResult[tx] += accumResult[stride + tx];
}
d_C[blockIdx.x] = accumResult[0];
}
According to the CUDA docs, if the global memory access is coalesced, 16 independent memory transactions will be merged into 1 memory transaction, which will result in highly efficient memory access. But what is the shared memory’s contribution to the performance if the memory access has coalesced? Under such a condition, why is there a performance difference between using shared memory and absence of shared memory? These are my puzzle, can anybody help me figure them out?
Thanks in advance!