Hi folks,
I’m sorry if this is very lame question.
I’ve been reading the presentation on optimization and came across the code on coalescing, which I’m not sure I can understand correctly.
__global__ void accessInt3Shared(float *g_in, float *g_out)
{
int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;
__shared__ float s_data[256*3];
s_data[threadIdx.x] = g_in[index];
s_data[threadIdx.x+256] = g_in[index+256];
s_data[threadIdx.x+512] = g_in[index+512];
__syncthreads();
float3 a = ((float3*)s_data)[threadIdx.x];
a.x += 2;
a.y += 2;
a.z += 2;
((float3*)s_data)[threadIdx.x] = a;
__syncthreads();
g_out[index] = s_data[threadIdx.x];
g_out[index+256] = s_data[threadIdx.x+256];
g_out[index+512] = s_data[threadIdx.x+512];
}
I can’t really understand why reading 256 and 512 bytes ahead helps… After the first __syncthreads we only modify 1 float3 and write it to shared mem and wait for other threads to do the same. After that we write 3 values to the global mem. But if s_data[x+256] was written by another thread (I don’t see why we need to write it back otherwise) shouldn’t it this another thread have written the value to g_out itself?
If other threads are not modifying s_data[x+256] why do we need to write it anyways? Or is it like prefetching for the next warps? I’m really confused.
Thanks!