Problem with bank conflict. Something wrong with my experiment?Confused!

Hi,all.

Now I take an experiment about share memory bank conflict.I launched four kernels,each for 100 times,the kernel code is as follows:

[codebox]

////////////////////////////////////////////////////

//* kernel 1: each thread deal with one 4-bytes word

//* without bank-conflict

///////////////////////////////////////////////////

global void

BankFlict_Test_kernel_1(unsigned int *d_out,unsigned int *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid * blockDim.x;

__shared__ unsigned int share[16];



share[tid] = d_in[baseIndex + tid];

share[tid] += 2;

d_out[baseIndex + tid] = share[tid];	

}

///////////////////////////////////////////////////

//* kernel 2: each thread deal with one 1-bytes char

//* causes 4-way bank-conflicts

//////////////////////////////////////////////////

global void

BankFlict_Test_kernel_2(char *d_out, char *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid * blockDim.x;

__shared__ char share[16];

share[tid] = d_in[baseIndex + tid];

share[tid] += 2;



d_out[baseIndex + tid] = share[tid];

}

//////////////////////////////////////////////////

//* kernel 3: every thread in one block access the

//* the same address,ie. broadcast without

//* bank-conflict

//////////////////////////////////////////////////

global void

BankFlict_Test_kernel_3(unsigned int *d_out,unsigned int *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid*blockDim.x;



__shared__ unsigned int share[16];

share[tid] = d_in[bid];

d_out[baseIndex + tid] = share[0]+2;

}

//////////////////////////////////////////////////

//* kernel 4: every thread in one block access the

//* same address,but access many times

//* and the address is different

//////////////////////////////////////////////////

global void

BankFlict_Test_kernel_4(unsigned int *d_out,unsigned int *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid*blockDim.x;

__shared__ unsigned int share[16];

share[tid] = d_in[baseIndex + tid];

for(int i = 0;i < blockDim.x; i++)

{

	d_out[baseIndex + tid] = share[i]+2;

	__syncthreads();

}

}

[/codebox]

The comment line shows what i think the kernel should be.The kernel was launched with <blockNUM,threadsForEachBlock>is<16,16>

But the result is very almost the same.

Kernel 1: processing time is: 0.409000 

Kernel 2: processing time is: 0.493000 

Kernel 3: processing time is: 0.476000 

Kernel 4: processing time is: 0.615000

why does this happen?Normally,The kernel 1 should be ~4x faster than kernel 2. Am i right?What about the kernel 4?It’s bank conflict-free?

I also run the kernel on the Visual Profiler, attched is the result.Any guys can make an explanation?THX.
Screenshot.png

My guess is that the cycles wasted by the bank conflicts are completely hidden by the time it takes to read and write global memory (much, much slower) in this code. Incidentally, this is why it is best to ignore bank conflicts until you have nothing left to optimize. Other factors in your kernel might make bank conflicts irrelevant. :)

Yeah, it’s a reasonable explanation.Considering that the share memory would only take 1~2 clock cycles to access the memory while 400~600 clock cycles in the global memory.

Thanks for your advice, It’s helpful.

Another optimization I would want to make is make the access of global memory coalescence. But the coalescence requires so many restriction on the G80 series card ,and the situation has something enhancement.

To my understanding , one thread only access one 32- 64- 128bit word,and the access must be serial,also another restriction is the kth half-warp must access the global memory with the start index should be multiple of 64 bytes according to one 4bytes word for each thread. And what does 64-bit word means? A double type data?Or two integer datas?Am I right? :wacko:

Thanks for your attention.

Most often the larger data types in CUDA are in fact structs made of smaller ones. For example, copying a float2 or float4 from global memory will cause the compiler to issue a 64 or 128 bit load.

OH,i ignored this build-in data types. :">

Am i right?

Seibert,thanks for your reply.