Accessing Shared memory: Unexpected timing result "Naive" access seems to be twice as fast a

Dear All,

I made a few speed measurements on a GeForce 8600M GT in order to find the quickest way to initialise a char-array in shared memory.

(This card does not support double precision, so the casting typelength is 4, and not 8, bytes. Choosing built-in vector-types of up to 16 bytes does not speed up the code apparently).

__global__ void initialise(char *tests)

{

    __shared__ union {

         char storage[16 * 256 *3]; //Naive

         char stor2[256 *3][16]; //Bank conflict free

         char stor3[16][256 *3]; // Bank conflicts!

    } stor;

unsigned int mydeal = 16 * 256 * 3 / 256; //my share of the deal

    unsigned int mydeal4 = 16 * 256 * 3 / (256 * 4); //my share of the deal int

    unsigned int mydeal8 = 16 * 256 * 3 / (256 * 8); //my share of the deal longlong

unsigned int BankID = threadIdx.x % 16;

    unsigned int IDinBank = (threadIdx.x - BankID) / 16;

for (int i=0;i<10000;i++){

//OPTION 1: Bank-conflict struck - blocky

        for(int j=0;j<mydeal4;j+=4){ //TWICE AS FAST AS BANK_CONFLICT FREE SCHEDULING!

            (*reinterpret_cast<int*>(&stor.storage[threadIdx.x * mydeal4 + j]) ) = 0;

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }

/*//OPTION 2: Bank-conflict free //BANK_CONFLICT FREE MAPPING!

        for(int j=0;j<mydeal4;j+=4){

            (*reinterpret_cast<int*>(&stor.stor2[BankID][IDinBank * mydeal4 + j]) ) = 0;

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }*/

/*//OPTION 3: Bank-conflict affected //TWICE AS SLOW AS BANK_CONFLICT FREE SCHEDULING!

        for(int j=0;j<mydeal4;j+=4){

            (*reinterpret_cast<int*>(&stor.stor3[BankID][IDinBank * mydeal4 + j]) ) = 0;

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }*/

/*//Bank-conflict struck - very blocky //Does not further speed up code.

        for(int j=0;j<mydeal8;j+=8){

            (*reinterpret_cast<longlong1*>(&storage[threadIdx.x * mydeal8 + j]) ) = make_longlong1(0);

            //tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );

        }*/

   }

}

Interestingly, the first “naive” option from above is about TWICE as fast as the bank-conflict-free memory management.

Does anyone know why that is?

Thanks and Regards

Christian

The “Bank-conflict free” case isn’t free of bank conflicts at all for two reasons:

    [*][font=“Courier New”]BankID[/font] needs to be the second aray index, not the first.

    [*]Banks are 32 bits/ 4 bytes wide, so the layout for the bank-conflict free case would need to be

char stor2[64 *3][64]; //Bank conflict free

or

int stor2[64 *3][16]; //Bank conflict free

Why do you think the “Option 2” is a “Bank-conflict free” ? Threads 0 and 4 will access the same first bank simultaneously.
Sorry, didn’t update page to see answer from ‘tera’.

Thanks, sorry yes that makes sense.
I think option 2 would be bank conflict free for Fermi arch and 32 bank size (EDIT: and swapped indices, argh!).
Cheers :)