performance for global and shared memory

thw81 · March 17, 2007, 1:07pm

Hi,

I am a bit confused by the global and shared memory performance description. Global memory reads are fastest if they are 16 bytes aligned and coalesced into one single contiguous block for the entire warp.

Now assume that each thread of a block loads a 16 byte aligned structure into shared memory. Further assume each thread performs calculations on the structure that it has loaded. Access to shared memory will then be highly inefficient as we have many bank conflicts.

For example using a 16 byte structure (e.g. float4). 16 threads of a warp will access the shared memory, and thus 4 x 4 threads will access the same bank!
To get it fast I would have to do a 4 byte padding within shared memory?

Would be happy about any comments! :)

Cheers
thw81

Mark_Harris · March 22, 2007, 11:43am

In data-parallel code it is usually best to use a Structure of Arrays (SoA) rather than Arrays of Structures (AoS).

So the following code results in tons of non-coalesced loads and shared memory bank conflicts.

myStructAoS

{

    float x, y;

    unsigned char b;

    unsigned int count;

};

__global__ void foo(myStructAoS *g_data, float* g_results)

{

    __shared__ myStructAoS structures[NUM_THREADS];

    

    // load into shared memory

    structures[threadIdx.x] = g_data[threadIdx.x];

    __syncthreads();

   // some computation

    float result = 0;

   for (int i = 0; i < structures[threadIdx.x].count; i++)

    {

        if (structures[threadIdx.x].b > 4)

            result += structures[threadIdx.x+i].x + structures[threadIdx.x+i].y;

    }

   g_results[threadIdx.x] = result;

}

The following code has coalesced loads (except for the array of unsigned char, which could be fixed by padding).

myStructSoA 

{

     float x[NUM_THREADS];

     float y[NUM_THREADS];

     unsigned char b[NUM_THREADS];

     unsigned int count[NUM_THREADS];

}

__global__ void foo(myStructSoA *g_data)

{

    __shared__ myStructSoA structures;

    

    // load into shared memory

    structures.x[threadIdx.x]       = g_data.x[threadIdx.x];

    structures.y[threadIdx.x]       = g_data.x[threadIdx.x];

    structures.b[threadIdx.x]       = g_data.x[threadIdx.x];

    structures.count[threadIdx.x] = g_data.x[threadIdx.x];

   // some computation

    float result = 0;

   for (int i = 0; i < structures.count[threadIdx.x]; i++)

    {

        if (structures.b[threadIdx.x] > 4)

            result += structures.x[threadIdx.x+i] + structures.y[threadIdx.x+i];

    }

   g_results[threadIdx.x] = result;

}

HTH,

Mark

Dmitriy_Dedkov · January 15, 2008, 9:06pm

NVIDIA should put this advice in HUGE letters in CUDA manual, I spent a long time figuring this out.

Topic		Replies	Views
Data load question CUDA Programming and Performance	3	45	December 18, 2024
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	601	July 12, 2011
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4425	May 12, 2015
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3525	July 12, 2011
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3203	October 14, 2009
Coalescing into shared memory CUDA Programming and Performance	1	1976	December 13, 2008
Shared memory question CUDA Programming and Performance	27	7370	June 23, 2008
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	5985	November 27, 2010
Loading structured data efficiently using CUDA can this be right? CUDA Programming and Performance	8	24314	November 9, 2009
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1171	March 30, 2009

performance for global and shared memory

Related topics