Dear All,
I made a few speed measurements on a GeForce 8600M GT in order to find the quickest way to initialise a char-array in shared memory.
(This card does not support double precision, so the casting typelength is 4, and not 8, bytes. Choosing built-in vector-types of up to 16 bytes does not speed up the code apparently).
__global__ void initialise(char *tests)
{
__shared__ union {
char storage[16 * 256 *3]; //Naive
char stor2[256 *3][16]; //Bank conflict free
char stor3[16][256 *3]; // Bank conflicts!
} stor;
unsigned int mydeal = 16 * 256 * 3 / 256; //my share of the deal
unsigned int mydeal4 = 16 * 256 * 3 / (256 * 4); //my share of the deal int
unsigned int mydeal8 = 16 * 256 * 3 / (256 * 8); //my share of the deal longlong
unsigned int BankID = threadIdx.x % 16;
unsigned int IDinBank = (threadIdx.x - BankID) / 16;
for (int i=0;i<10000;i++){
//OPTION 1: Bank-conflict struck - blocky
for(int j=0;j<mydeal4;j+=4){ //TWICE AS FAST AS BANK_CONFLICT FREE SCHEDULING!
(*reinterpret_cast<int*>(&stor.storage[threadIdx.x * mydeal4 + j]) ) = 0;
//tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );
}
/*//OPTION 2: Bank-conflict free //BANK_CONFLICT FREE MAPPING!
for(int j=0;j<mydeal4;j+=4){
(*reinterpret_cast<int*>(&stor.stor2[BankID][IDinBank * mydeal4 + j]) ) = 0;
//tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );
}*/
/*//OPTION 3: Bank-conflict affected //TWICE AS SLOW AS BANK_CONFLICT FREE SCHEDULING!
for(int j=0;j<mydeal4;j+=4){
(*reinterpret_cast<int*>(&stor.stor3[BankID][IDinBank * mydeal4 + j]) ) = 0;
//tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );
}*/
/*//Bank-conflict struck - very blocky //Does not further speed up code.
for(int j=0;j<mydeal8;j+=8){
(*reinterpret_cast<longlong1*>(&storage[threadIdx.x * mydeal8 + j]) ) = make_longlong1(0);
//tests[threadIdx.x * mydeal + j] = (*reinterpret_cast<int*>(&storage[threadIdx.x * mydeal4 + j]) );
}*/
}
}
Interestingly, the first “naive” option from above is about TWICE as fast as the bank-conflict-free memory management.
Does anyone know why that is?
Thanks and Regards
Christian