Memory access coalescing Vs. the compiler

Hi,

I need to copy 3D coordinates from global to shared memory.

My first try was:

float3 global[];

float3 shared[];

shared[tid] = global[tid];

(tid being thread id)

This is not good because each thread accesses three 32 bit values in global memory, so no memory access coalescing.

If I understand the programming guide I should use 128 bit types instead.

So I did:

float4 global[];

float3 shared[];

float 4 temp;

temp = global[tid];

shared[tid].x = temp.x;

shared[tid].y = temp.y;

shared[tid].z = temp.z;

But If I look at the .ptx that is produced, the compiler does not issue a single 128bit read for ‘temp’, but a 64-bit read for temp.x and temp.y, and a 32-bit read for temp.z.

One way to force the compiler to issue a 128-bit read is to do:

float4 global[];

float4 shared[];

shared[tid] = global[tid];

But:

1/that is a waste of shared memory

2/useless writes to shared memory occur

3/the size of float4 is not good to avoid bank conflicts (right ?)

Right now my only solution seems to use the second solution and to fix the .ptx file by hand.

Is there a better way ?

Or is the .ptx going to be re-optimized later and issue 128-bits read even if the compiler used a 64bit plus a 32bit read ?

There are several ways to go about loading float3s with coalescing. One would be to use smem as follows.

Treat the array in global memory, as well as shared memory as arrays of floats, not float3s. When your kernel is moving data from gmem to smem, each thread will perform 3 reads of scalar floats. However, the 2nd read will be (#threads/block) floats away from the first one, the 3nd one will be 2*(#threads/block) away from the first one. Thus, each of the three reads will be coalesced.

When processing the data inside the kernel, a thread can grab its piece of data by casting the smem array to float3 type. Compute code doesn’t change from that point.

Writing the result back to gmem uses the same approach as reading.

This may seem convoluted at first, but only the gmem access code changes, the rest is the same. And the performance is equal to that of a coalesced transfers (all the reads/writes are coalesced, after all). Below are the uncoalesced and coalesced code samples (the second one is hardcoded for the assumption that there are 256 threads per block):

__global__ void accessKernelFloat3(float3 *d_in, float3 *d_out)

{

    int index=blockIdx.x*blockDim.x+threadIdx.x;

    float3 a=d_in[index];

    a.x+=2;

    a.y+=2;

    a.z+=2;

    d_out[index]=a;

}

__global__ void accessKernelFloat3Shared(float *g_in, float *g_out)

{

    int index=blockIdx.x*blockDim.x+threadIdx.x;

	

    __shared__ int s_data[256*3];

    s_data[threadIdx.x]=g_in[index];

    s_data[threadIdx.x+256]=g_in[index+256];

    s_data[threadIdx.x+512]=g_in[index+512];

    __syncthreads();

   float3 a=((float3*)s_data)[threadIdx.x];

    a.x+=2;

    a.y+=2;

    a.z+=2;

    ((float3*)s_data)[threadIdx.x]=a;

    __syncthreads();

	

    g_out[index]=s_data[threadIdx.x];

    g_out[index+256]=s_data[threadIdx.x+256];

    g_out[index+512]=s_data[threadIdx.x+512];

}

Oh, OK.
Thank you !