performance for global and shared memory

Hi,

I am a bit confused by the global and shared memory performance description. Global memory reads are fastest if they are 16 bytes aligned and coalesced into one single contiguous block for the entire warp.

Now assume that each thread of a block loads a 16 byte aligned structure into shared memory. Further assume each thread performs calculations on the structure that it has loaded. Access to shared memory will then be highly inefficient as we have many bank conflicts.

For example using a 16 byte structure (e.g. float4). 16 threads of a warp will access the shared memory, and thus 4 x 4 threads will access the same bank!
To get it fast I would have to do a 4 byte padding within shared memory?

Would be happy about any comments! :)

Cheers
thw81

In data-parallel code it is usually best to use a Structure of Arrays (SoA) rather than Arrays of Structures (AoS).

So the following code results in tons of non-coalesced loads and shared memory bank conflicts.

myStructAoS

{

    float x, y;

    unsigned char b;

    unsigned int count;

};

__global__ void foo(myStructAoS *g_data, float* g_results)

{

    __shared__ myStructAoS structures[NUM_THREADS];

    

    // load into shared memory

    structures[threadIdx.x] = g_data[threadIdx.x];

    __syncthreads();

   // some computation

    float result = 0;

   for (int i = 0; i < structures[threadIdx.x].count; i++)

    {

        if (structures[threadIdx.x].b > 4)

            result += structures[threadIdx.x+i].x + structures[threadIdx.x+i].y;

    }

   g_results[threadIdx.x] = result;

}

The following code has coalesced loads (except for the array of unsigned char, which could be fixed by padding).

myStructSoA 

{

     float x[NUM_THREADS];

     float y[NUM_THREADS];

     unsigned char b[NUM_THREADS];

     unsigned int count[NUM_THREADS];

}

__global__ void foo(myStructSoA *g_data)

{

    __shared__ myStructSoA structures;

    

    // load into shared memory

    structures.x[threadIdx.x]       = g_data.x[threadIdx.x];

    structures.y[threadIdx.x]       = g_data.x[threadIdx.x];

    structures.b[threadIdx.x]       = g_data.x[threadIdx.x];

    structures.count[threadIdx.x] = g_data.x[threadIdx.x];

   // some computation

    float result = 0;

   for (int i = 0; i < structures.count[threadIdx.x]; i++)

    {

        if (structures.b[threadIdx.x] > 4)

            result += structures.x[threadIdx.x+i] + structures.y[threadIdx.x+i];

    }

   g_results[threadIdx.x] = result;

}

HTH,

Mark

NVIDIA should put this advice in HUGE letters in CUDA manual, I spent a long time figuring this out.