Hi Experts,
I have three set of array values in my card global memory.for Ex
A[] = {2,5,3,5....}
B[] = {1,2,5,4....}
C[] = {0,0,0,0....}
In my kernel i have to add value of A & B and assign it to C.
int p = blockIdx * BLOCK_SIZE + threadIdx;
int q = threadIdx;
Way 1:
C[p] = A[p] + B[p]
or
Way 2
__shared__ int Asub[BLOCK_SIZE];
__shared__ int Bsub[BLOCK_SIZE];
Asub[q] = A[p];
Bsub[q] = B[p];
__syncthreads();
C[p] = Asub[q] + Bsub[q];
Which is faster? Way1 or Way2? :unsure:
I want to know which is most time consuming task? Accessing global memory from kernel or copying global memory to shared memory?