Coalesced float3 code


I am reading “High performance computing” from supercomputing 2007. But I need some explanations about the coalesced float3 code. The goal is to access contiguously to the global memory, but the code is:

__global__ void accessInt3Shared(float *g_in, float *g_out)


int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;

__shared__ float s_data[256*3];

s_data[threadIdx.x] = g_in[index];

s_data[threadIdx.x+256] = g_in[index+256];

s_data[threadIdx.x+512] = g_in[index+512];


float3 a = ((float3*)s_data)[threadIdx.x];


According to the fact that index of successive threads is not contiguous (since multplied by 3 at the first line), the access to the global memory seems not continuous:

s_data[threadIdx.x] = g_in[index];

I suppose I’m wrong… So, is anyone could explain me where ?


EDIT(second): code is right

Each thread load 3 float numbers, thus 3 * blockIdx.x * blockDim.x is the base address of this block and index is the address of the first number which is presided over by this thread.
The first thread loads g_in[base_addr], g_in[base_addr+256], g_in[base_addr+512]
The second loads g_in[base_addr+1], g_in[base_addr+257], g_in[base_addr+513]

So the threads in the same warp access continuous addresses.
Is this right?

Yes, you are OK