Accessing Shared memory: Unexpected timing result "Naive" access seems to be twice as fast a

chris90 · February 27, 2012, 2:05pm

Dear All,

I made a few speed measurements on a GeForce 8600M GT in order to find the quickest way to initialise a char-array in shared memory.

(This card does not support double precision, so the casting typelength is 4, and not 8, bytes. Choosing built-in vector-types of up to 16 bytes does not speed up the code apparently).

__global__ void initialise(char *tests)

{

    __shared__ union {

         char storage[16 * 256 *3]; //Naive

         char stor2[256 *3][16]; //Bank conflict free

         char stor3[16][256 *3]; // Bank conflicts!

    } stor;

unsigned int mydeal = 16 * 256 * 3 / 256; //my share of the deal

    unsigned int mydeal4 = 16 * 256 * 3 / (256 * 4); //my share of the deal int

    unsigned int mydeal8 = 16 * 256 * 3 / (256 * 8); //my share of the deal longlong

unsigned int BankID = threadIdx.x % 16;

    unsigned int IDinBank = (threadIdx.x - BankID) / 16;

for (int i=0;i<10000;i++){

//OPTION 1: Bank-conflict struck - blocky

        for(int j=0;j<mydeal4;j+=4){ //TWICE AS FAST AS BANK_CONFLICT FREE SCHEDULING!

            (*reinterpret_cast<int*>(&stor.storage[threadIdx.x * mydeal4 + j]) ) = 0;

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }

/*//OPTION 2: Bank-conflict free //BANK_CONFLICT FREE MAPPING!

        for(int j=0;j<mydeal4;j+=4){

            (*reinterpret_cast<int*>(&stor.stor2[BankID][IDinBank * mydeal4 + j]) ) = 0;

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }*/

/*//OPTION 3: Bank-conflict affected //TWICE AS SLOW AS BANK_CONFLICT FREE SCHEDULING!

        for(int j=0;j<mydeal4;j+=4){

            (*reinterpret_cast<int*>(&stor.stor3[BankID][IDinBank * mydeal4 + j]) ) = 0;

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }*/

/*//Bank-conflict struck - very blocky //Does not further speed up code.

        for(int j=0;j<mydeal8;j+=8){

            (*reinterpret_cast<longlong1*>(&storage[threadIdx.x * mydeal8 + j]) ) = make_longlong1(0);

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }*/

   }

}

Interestingly, the first “naive” option from above is about TWICE as fast as the bank-conflict-free memory management.

Does anyone know why that is?

Thanks and Regards

Christian

tera · February 27, 2012, 9:55pm

The “Bank-conflict free” case isn’t free of bank conflicts at all for two reasons:

[*][font=“Courier New”]BankID[/font] needs to be the second aray index, not the first.

[*]Banks are 32 bits/ 4 bytes wide, so the layout for the bank-conflict free case would need to be

char stor2[64 *3][64]; //Bank conflict free

or

int stor2[64 *3][16]; //Bank conflict free

L_F · February 27, 2012, 10:42pm

Why do you think the “Option 2” is a “Bank-conflict free” ? Threads 0 and 4 will access the same first bank simultaneously.
Sorry, didn’t update page to see answer from ‘tera’.

chris90 · February 27, 2012, 10:45pm

Thanks, sorry yes that makes sense.
I think option 2 would be bank conflict free for Fermi arch and 32 bank size (EDIT: and swapped indices, argh!).
Cheers :)

Topic		Replies	Views
Shared memory avoiding bank conflict less effective CUDA Programming and Performance	3	3792	May 6, 2010
Newbie: __shared__, what am I doing wrong? shared memory seems to be slowing things down - why? CUDA Programming and Performance	2	1857	November 21, 2009
Very strange share memory bank conflicts CUDA-MEMCHECK cuda	1	1006	October 15, 2021
shared memory without bank conflict slower than that with bank conflict CUDA Programming and Performance	2	963	November 28, 2019
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1344	February 26, 2009
Shared memory bank conflicts? CUDA Programming and Performance	0	866	June 4, 2009
Very strange share memory bank conflicts CUDA Programming and Performance cuda	4	611	November 2, 2021
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	16492	November 19, 2025
beginner question regarding shared memory CUDA Programming and Performance	4	7049	November 16, 2009
Problems with using shared memory CUDA Programming and Performance	5	5946	September 14, 2009

Accessing Shared memory: Unexpected timing result "Naive" access seems to be twice as fast a

Related topics