In cuda programming guide, they said the device can read 32-bit 64-bit,or 128-bit wrods from global memory into registers in a single instruction.
So i try to make a test, with the following code
typede sturct __align(16)
shared float5 sdata[BLOCK_DIM];
int index = blockIdx.x * blockDim.x+ threadIdx.x;
int tx = threadIdx.x;
sdata[tx] = idata[index];
odata[index] = sdata[tx];
i thought it should be coalesced,however when i use the cuda visual profiler to test the result is not coherent.
can any one tell me why? and how can I to make it coalseced except the following method:
shared float s_data[BLOCK_DIM4];
s_data[threadIdx.x] = idata[index];
s_data[threadIdx.x+BLOCK_DIM] = idata[index+BLOCK_DIM];
s_data[threadIdx.x+2BLOCK_DIM] = idata[index+2BLOCK_DIM];
s_data[threadIdx.x+3BLOCK_DIM] = idata[index+3*BLOCK_DIM];