Efficient way of doing?

Hi Experts,

       I have three set of array values in my card global memory.for Ex
       
          A[] = {2,5,3,5....}
       
          B[] = {1,2,5,4....}

          C[] = {0,0,0,0....}

      In my kernel i have to add value of A & B and assign it to C.

      int p = blockIdx * BLOCK_SIZE + threadIdx; 
      int q = threadIdx;
  Way 1:
         
           C[p] = A[p] + B[p]              

   or

   Way 2

            __shared__ int Asub[BLOCK_SIZE];     
            __shared__ int Bsub[BLOCK_SIZE];

            Asub[q] = A[p];
            Bsub[q] = B[p];
           
            __syncthreads();

            C[p] = Asub[q] + Bsub[q];

Which is faster? Way1 or Way2? :unsure:

I want to know which is most time consuming task? Accessing global memory from kernel or copying global memory to shared memory?

There won’t be any benefit in using shared memory in this case.

K. Thanks Mr.avidday

then when will Shared memory usage useful?

copying values to shared memory is more time consuming?

When threads in a block need to read the same value from global memory more than once, or when the read pattern of the threads in a warp/half-warp breaks the coalescing rules for efficient reads from global memory. The latter is only really important on compute capability 1.0/1.1 devices. One newer hardware the coalescing rules are greatly relaxed, and Fermi has useful L1/L2 global memory cache which helps even more.

No. But it isn’t any faster if each block just uses the values in shared memory once.

K.Thank you :rolleyes: