Hi,
I am reading “High performance computing” from supercomputing 2007. But I need some explanations about the coalesced float3 code. The goal is to access contiguously to the global memory, but the code is:
__global__ void accessInt3Shared(float *g_in, float *g_out)
{
int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float s_data[256*3];
s_data[threadIdx.x] = g_in[index];
s_data[threadIdx.x+256] = g_in[index+256];
s_data[threadIdx.x+512] = g_in[index+512];
__syncthreads();
float3 a = ((float3*)s_data)[threadIdx.x];
//[...]
According to the fact that index of successive threads is not contiguous (since multplied by 3 at the first line), the access to the global memory seems not continuous:
s_data[threadIdx.x] = g_in[index];
I suppose I’m wrong… So, is anyone could explain me where ?
Thx
EDIT(second): code is right