coalescing example

_DK · May 9, 2009, 9:15pm

Hi folks,

I’m sorry if this is very lame question.

I’ve been reading the presentation on optimization and came across the code on coalescing, which I’m not sure I can understand correctly.

__global__ void accessInt3Shared(float *g_in, float *g_out)

{

int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;

__shared__ float s_data[256*3];

s_data[threadIdx.x] = g_in[index];

s_data[threadIdx.x+256] = g_in[index+256];

s_data[threadIdx.x+512] = g_in[index+512];

__syncthreads();

float3 a = ((float3*)s_data)[threadIdx.x];

a.x += 2;

a.y += 2;

a.z += 2;

((float3*)s_data)[threadIdx.x] = a;

__syncthreads();

g_out[index] = s_data[threadIdx.x];

g_out[index+256] = s_data[threadIdx.x+256];

g_out[index+512] = s_data[threadIdx.x+512];

}

I can’t really understand why reading 256 and 512 bytes ahead helps… After the first __syncthreads we only modify 1 float3 and write it to shared mem and wait for other threads to do the same. After that we write 3 values to the global mem. But if s_data[x+256] was written by another thread (I don’t see why we need to write it back otherwise) shouldn’t it this another thread have written the value to g_out itself?

If other threads are not modifying s_data[x+256] why do we need to write it anyways? Or is it like prefetching for the next warps? I’m really confused.

Thanks!

Jamie_K · May 10, 2009, 12:41am

The key is that there are three floats in a float3 per thread. The x, y, and z elements are interleaved. If g_in were cast to a float3 pointer (call it “vin”) then the data would be organized like this:
vin[0].x, vin[0].y, vin[0].z, vin[1].x, vin[1].y, vin[1].z, vin[2].x, etc…

Loading vin[threadIdx.x].x (which would be simplest) would be uncoalesced and give bad performance.

So the entire block of blockDim*3 values is copied to shared memory first (blockDim has to be 256 for this code to work). Thread 0 copies vin[0].x, thread 1 copies vin[0].y, thread 2 copies vin[0].z, thread 3 copies vin[1].x, and so on, so it is coalesced. Since there are three times as many values as threads, each thread has to copy three values.

_DK · May 11, 2009, 3:27pm

Thanks Jamie!

What I actually missed was

float s_data

being cast to

(float3*)s_data

So I was thinking s_data was float3 and got confused.

Topic		Replies	Views
Memory access coalescing Vs. the compiler CUDA Programming and Performance	2	2496	July 23, 2007
Coalesced Access Is this coalesced? CUDA Programming and Performance	7	2659	June 17, 2008
want to know more detail of memory coalescing CUDA Programming and Performance	4	2075	November 11, 2008
Coalesced float3 code CUDA Programming and Performance	2	4221	July 23, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16829	March 31, 2011
Another question about coalesced reads/writes CUDA Programming and Performance	10	2281	August 18, 2009
Is these way coalesced access? CUDA Programming and Performance	0	422	March 6, 2020
Shared memory question CUDA Programming and Performance	27	7694	June 23, 2008
coalesced read to shared memory CUDA Programming and Performance	0	597	October 27, 2011
coalesced access to global memory CUDA Programming and Performance	6	1265	May 8, 2014

coalescing example

Related topics