does this way can avoid non-coalesced access？

742820157 · February 27, 2020, 1:05pm

//Pseudo code
// code 1
__global__ void func(double data[], double output)
{
 
__shared__ double temp[BLOCK_SIZE * 3];

temp[threadId] = data[r1]; //r1 mean the random number（ it may be different in different thread）
temp[threadId+ 1] = data[r2]; //r2 mean the random number
temp[threadId+ 2] = data[r3]; //r3 mean the random number
 
__syncthreads();

out = temp[threadId] + temp[threadId + 1] + temp[threadId + 2];

}

//Pseudo code
// code 2
__global__ void func(double data[], double output)
{

out = data[r1] + data[r2] + data[r3];

}

is the code 1 more efficient than code 2？
if question is correct，does code 1 can avoid non-coalesced access? why? does the data were writen to shared memory from global memory without non-coalesced access? how it work?

Thank You!

Greg · February 27, 2020, 2:50pm

The code example above perform result in all threads performing load/stores to the same address. Address divergence (non-coalesced) only occur if each thread in the warp accesses a different 128B address range. The LSU unit support broadcast on read if multiple threads read the same address. If multiple threads in a warp store to the same address in the same instruction then the LSU unit will pick one thread to write data. In the code example all threads write to the same address so there is no address divergence.

Both kernels have no side effect so the compiler will fully optimize the kernel to a null kernel.

Please describe the problem you are trying to solve and how you plan to map threads to data.

742820157 · February 27, 2020, 3:23pm

thank you.
I mean that this is pseudo code. and the variables (such as r1) are the random number.(for example, the r1 may different in different thread ).