note “When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions”.
but I have some questions.
1.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
b[i] = a[i] + a[i + 1] + a[i + 2];
}
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
or does the coalesced only exist in the different thread(transverse) of a warp?(no exist in a thread?)
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
b[i] = a[i] + a[i + 10] + a[i + 12];//assuming no out of indeax
}
It may can be the non-coalesced access.
so I change the code to:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ double shareM[3*BLOCK_SIZE];
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
b[i] = shareM[threadIdx.x] + shareM[threadIdx.x + 1] + shareM[threadIdx.x + 2];
}
I write the data to the shared memory from global, then read out. can this way avoid the non-coalesced access for improving the performance?
[/code]
Thank you very much.