Memory access coalescing Vs. the compiler

Simon · July 20, 2007, 11:44am

Hi,

I need to copy 3D coordinates from global to shared memory.

My first try was:

float3 global[];

float3 shared[];

shared[tid] = global[tid];

(tid being thread id)

This is not good because each thread accesses three 32 bit values in global memory, so no memory access coalescing.

If I understand the programming guide I should use 128 bit types instead.

So I did:

float4 global[];

float3 shared[];

float 4 temp;

temp = global[tid];

shared[tid].x = temp.x;

shared[tid].y = temp.y;

shared[tid].z = temp.z;

But If I look at the .ptx that is produced, the compiler does not issue a single 128bit read for ‘temp’, but a 64-bit read for temp.x and temp.y, and a 32-bit read for temp.z.

One way to force the compiler to issue a 128-bit read is to do:

float4 global[];

float4 shared[];

shared[tid] = global[tid];

But:

1/that is a waste of shared memory

2/useless writes to shared memory occur

3/the size of float4 is not good to avoid bank conflicts (right ?)

Right now my only solution seems to use the second solution and to fix the .ptx file by hand.

Is there a better way ?

Or is the .ptx going to be re-optimized later and issue 128-bits read even if the compiler used a 64bit plus a 32bit read ?

paulius · July 20, 2007, 6:10pm

There are several ways to go about loading float3s with coalescing. One would be to use smem as follows.

Treat the array in global memory, as well as shared memory as arrays of floats, not float3s. When your kernel is moving data from gmem to smem, each thread will perform 3 reads of scalar floats. However, the 2nd read will be (#threads/block) floats away from the first one, the 3nd one will be 2*(#threads/block) away from the first one. Thus, each of the three reads will be coalesced.

When processing the data inside the kernel, a thread can grab its piece of data by casting the smem array to float3 type. Compute code doesn’t change from that point.

Writing the result back to gmem uses the same approach as reading.

This may seem convoluted at first, but only the gmem access code changes, the rest is the same. And the performance is equal to that of a coalesced transfers (all the reads/writes are coalesced, after all). Below are the uncoalesced and coalesced code samples (the second one is hardcoded for the assumption that there are 256 threads per block):

__global__ void accessKernelFloat3(float3 *d_in, float3 *d_out)

{

    int index=blockIdx.x*blockDim.x+threadIdx.x;

    float3 a=d_in[index];

    a.x+=2;

    a.y+=2;

    a.z+=2;

    d_out[index]=a;

}

__global__ void accessKernelFloat3Shared(float *g_in, float *g_out)

{

    int index=blockIdx.x*blockDim.x+threadIdx.x;

	

    __shared__ int s_data[256*3];

    s_data[threadIdx.x]=g_in[index];

    s_data[threadIdx.x+256]=g_in[index+256];

    s_data[threadIdx.x+512]=g_in[index+512];

    __syncthreads();

   float3 a=((float3*)s_data)[threadIdx.x];

    a.x+=2;

    a.y+=2;

    a.z+=2;

    ((float3*)s_data)[threadIdx.x]=a;

    __syncthreads();

	

    g_out[index]=s_data[threadIdx.x];

    g_out[index+256]=s_data[threadIdx.x+256];

    g_out[index+512]=s_data[threadIdx.x+512];

}

Simon · July 23, 2007, 7:43am

Oh, OK.
Thank you !

Topic		Replies	Views
coalescing example CUDA Programming and Performance	2	1300	May 11, 2009
Memory coalescing in one thread CUDA Programming and Performance	17	16734	March 31, 2011
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1201	March 30, 2009
Memory Coalescing CUDA Programming and Performance	5	9331	October 15, 2011
Another question about coalesced reads/writes CUDA Programming and Performance	10	2242	August 18, 2009
coalesced struct reads CUDA Programming and Performance	2	768	September 29, 2011
gbl32/64/128 coalescing doubt CUDA Programming and Performance	6	1707	September 13, 2010
Coalesced float3 code CUDA Programming and Performance	2	4203	July 23, 2008
Global memory access for float and integer, the speed is the same, right ? CUDA Programming and Performance	2	3098	March 3, 2012
Shared memory question CUDA Programming and Performance	27	7531	June 23, 2008

Memory access coalescing Vs. the compiler

Related topics