about shared memory's contribution to performance when global memory access is coalesced

Gimurk · July 11, 2011, 1:43pm

Hi everyone:

I has two kernels, which are the same except an operation of loading from the global memory and then making a multiplication(marked using bold). I timed this 2 kernels, which displayed that the kernel using the shared memory is 1/4 faster than the other. The following is the defination of these 2 kernels.

__global__ void

sum1(float *d_C, float *d_A, float *d_B, int ElementN)

{

     __shared__ float accumResult[ELEMENT_N];

     //Current vectors bases

     float *A = d_A + ElementN * blockIdx.x;

     float *B = d_B + ElementN * blockIdx.x;

     int tx = threadIdx.x;

<b>accumResult[tx] = A[tx] * B[tx];</b>

for(int stride = ElementN /2; stride > 0; stride >>= 1)

     {

         __syncthreads();

         if(tx < stride)

         accumResult[tx] += accumResult[stride + tx];

     }

     d_C[blockIdx.x] = accumResult[0];

}

global void

sum2(float *d_C, float *d_A, float *d_B, int ElementN)

{
 __shared__ float accumResult[ELEMENT_N];
//Current vectors bases
 float *A = d_A + ElementN * blockIdx.x;

 float *B = d_B + ElementN * blockIdx.x;

 int tx = threadIdx.x;
[b]accumResult[tx] = A[tx];
 accumResult[tx] *= B[tx];[/b]
for(int stride = ElementN /2; stride > 0; stride >>= 1)
 {

     __syncthreads();

     if(tx < stride)

     accumResult[tx] += accumResult[stride + tx];

 }

 d_C[blockIdx.x] = accumResult[0];
}

According to the CUDA docs, if the global memory access is coalesced, 16 independent memory transactions will be merged into 1 memory transaction, which will result in highly efficient memory access. But what is the shared memory’s contribution to the performance if the memory access has coalesced? Under such a condition, why is there a performance difference between using shared memory and absence of shared memory? These are my puzzle, can anybody help me figure them out?

Thanks in advance!

Gimurk · July 11, 2011, 2:08pm

global void

sum1(float *d_C, float *d_A, float *d_B, int ElementN)

{
 __shared__ float accumResult[ELEMENT_N];

 //Current vectors bases

 float *A = d_A + ElementN * blockIdx.x;

 float *B = d_B + ElementN * blockIdx.x;

 int tx = threadIdx.x;
accumResult[tx] = A[tx] * B[tx];

for(int stride = ElementN /2; stride > 0; stride >>= 1)
 {

     __syncthreads();

     if(tx < stride)

     accumResult[tx] += accumResult[stride + tx];

 }

 d_C[blockIdx.x] = accumResult[0];
}

veda87 · July 11, 2011, 7:19pm

Seems like both the codes are same.

Gimurk · July 12, 2011, 1:12am

no, “accumResult[tx] = A[tx] * B[tx];” in sum1() is replaced by “accumResult[tx] = A[tx]; accumResult[tx] *= B[tx];” in sum2(), the others are indeed the same.

Topic		Replies	Views
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	622	July 12, 2011
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3285	October 14, 2009
Shared memory doubt CUDA Programming and Performance	5	4666	June 11, 2008
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1239	April 26, 2013
Memory coalescing in one thread CUDA Programming and Performance	17	16769	March 31, 2011
Is these way coalesced access? CUDA Programming and Performance	0	419	March 6, 2020
Why is the performance more? Refering to Dr Dobbs article CUDA Programming and Performance	10	2741	April 23, 2010
Coalesced acces slower than non coalesced CUDA Programming and Performance	4	934	February 7, 2011
shared memory problem CUDA Programming and Performance	2	1221	April 21, 2010
Basic Performance Ques.. ? from a non-CS .. noob CUDA Programming and Performance	3	2616	July 2, 2009

about shared memory's contribution to performance when global memory access is coalesced

Related topics