coalescing example

Hi folks,

I’m sorry if this is very lame question.

I’ve been reading the presentation on optimization and came across the code on coalescing, which I’m not sure I can understand correctly.

__global__ void accessInt3Shared(float *g_in, float *g_out)

{

int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;

__shared__ float s_data[256*3];

s_data[threadIdx.x] = g_in[index];

s_data[threadIdx.x+256] = g_in[index+256];

s_data[threadIdx.x+512] = g_in[index+512];

__syncthreads();

float3 a = ((float3*)s_data)[threadIdx.x];

a.x += 2;

a.y += 2;

a.z += 2;

((float3*)s_data)[threadIdx.x] = a;

__syncthreads();

g_out[index] = s_data[threadIdx.x];

g_out[index+256] = s_data[threadIdx.x+256];

g_out[index+512] = s_data[threadIdx.x+512];

}

I can’t really understand why reading 256 and 512 bytes ahead helps… After the first __syncthreads we only modify 1 float3 and write it to shared mem and wait for other threads to do the same. After that we write 3 values to the global mem. But if s_data[x+256] was written by another thread (I don’t see why we need to write it back otherwise) shouldn’t it this another thread have written the value to g_out itself?

If other threads are not modifying s_data[x+256] why do we need to write it anyways? Or is it like prefetching for the next warps? I’m really confused.

Thanks!

The key is that there are three floats in a float3 per thread. The x, y, and z elements are interleaved. If g_in were cast to a float3 pointer (call it “vin”) then the data would be organized like this:
vin[0].x, vin[0].y, vin[0].z, vin[1].x, vin[1].y, vin[1].z, vin[2].x, etc…

Loading vin[threadIdx.x].x (which would be simplest) would be uncoalesced and give bad performance.

So the entire block of blockDim*3 values is copied to shared memory first (blockDim has to be 256 for this code to work). Thread 0 copies vin[0].x, thread 1 copies vin[0].y, thread 2 copies vin[0].z, thread 3 copies vin[1].x, and so on, so it is coalesced. Since there are three times as many values as threads, each thread has to copy three values.

Thanks Jamie!

What I actually missed was

float s_data

being cast to

(float3*)s_data

So I was thinking s_data was float3 and got confused.