Problem with bank conflict. Something wrong with my experiment?Confused!

Austin · February 25, 2009, 2:00am

Hi,all.

Now I take an experiment about share memory bank conflict.I launched four kernels,each for 100 times,the kernel code is as follows:

[codebox]

////////////////////////////////////////////////////

//* kernel 1: each thread deal with one 4-bytes word

//* without bank-conflict

///////////////////////////////////////////////////

global void

BankFlict_Test_kernel_1(unsigned int *d_out,unsigned int *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid * blockDim.x;

__shared__ unsigned int share[16];



share[tid] = d_in[baseIndex + tid];

share[tid] += 2;

d_out[baseIndex + tid] = share[tid];

}

///////////////////////////////////////////////////

//* kernel 2: each thread deal with one 1-bytes char

//* causes 4-way bank-conflicts

//////////////////////////////////////////////////

global void

BankFlict_Test_kernel_2(char *d_out, char *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid * blockDim.x;

__shared__ char share[16];

share[tid] = d_in[baseIndex + tid];

share[tid] += 2;



d_out[baseIndex + tid] = share[tid];

}

//////////////////////////////////////////////////

//* kernel 3: every thread in one block access the

//* the same address,ie. broadcast without

//* bank-conflict

//////////////////////////////////////////////////

global void

BankFlict_Test_kernel_3(unsigned int *d_out,unsigned int *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid*blockDim.x;



__shared__ unsigned int share[16];

share[tid] = d_in[bid];

d_out[baseIndex + tid] = share[0]+2;

}

//////////////////////////////////////////////////

//* kernel 4: every thread in one block access the

//* same address,but access many times

//* and the address is different

//////////////////////////////////////////////////

global void

BankFlict_Test_kernel_4(unsigned int *d_out,unsigned int *d_in)

{

const unsigned int tid = threadIdx.x;

const unsigned int bid = blockIdx.x;

const unsigned int baseIndex = bid*blockDim.x;

__shared__ unsigned int share[16];

share[tid] = d_in[baseIndex + tid];

for(int i = 0;i < blockDim.x; i++)

{

	d_out[baseIndex + tid] = share[i]+2;

	__syncthreads();

}

}

[/codebox]

The comment line shows what i think the kernel should be.The kernel was launched with <blockNUM,threadsForEachBlock>is<16,16>

But the result is very almost the same.

Kernel 1: processing time is: 0.409000 

Kernel 2: processing time is: 0.493000 

Kernel 3: processing time is: 0.476000 

Kernel 4: processing time is: 0.615000

why does this happen?Normally,The kernel 1 should be ~4x faster than kernel 2. Am i right?What about the kernel 4?It’s bank conflict-free?

I also run the kernel on the Visual Profiler, attched is the result.Any guys can make an explanation?THX.

seibert · February 25, 2009, 1:34pm

My guess is that the cycles wasted by the bank conflicts are completely hidden by the time it takes to read and write global memory (much, much slower) in this code. Incidentally, this is why it is best to ignore bank conflicts until you have nothing left to optimize. Other factors in your kernel might make bank conflicts irrelevant. :)

Austin · February 26, 2009, 1:30am

Yeah, it’s a reasonable explanation.Considering that the share memory would only take 1~2 clock cycles to access the memory while 400~600 clock cycles in the global memory.

Thanks for your advice, It’s helpful.

Another optimization I would want to make is make the access of global memory coalescence. But the coalescence requires so many restriction on the G80 series card ,and the situation has something enhancement.

The global memory access by all threads of a half-warp is coalesced into one or two

memory transactions if it satisfies the following three conditions:

Threads must access
1: Either 32-bit words, resulting in one 64-byte memory transaction,

2:Or 64-bit words, resulting in one 128-byte memory transaction,

3:Or 128-bit words, resulting in two 128-byte memory transactions;
All 16 words must lie in the same segment of size equal to the memory

transaction size (or twice the memory transaction size when accessing 128-bit

words);

Threads must access the words in sequence: The kth thread in the half-warp must

access the kth word.

To my understanding , one thread only access one 32- 64- 128bit word,and the access must be serial,also another restriction is the kth half-warp must access the global memory with the start index should be multiple of 64 bytes according to one 4bytes word for each thread. And what does 64-bit word means? A double type data?Or two integer datas?Am I right? :wacko:

Thanks for your attention.

seibert · February 26, 2009, 2:02am

Most often the larger data types in CUDA are in fact structs made of smaller ones. For example, copying a float2 or float4 from global memory will cause the compiler to issue a 64 or 128 bit load.

Austin · February 26, 2009, 2:34am

OH,i ignored this build-in data types. :">

Am i right?

Seibert,thanks for your reply.

Topic		Replies	Views
Shared memory access patterns CUDA Programming and Performance	2	1147	March 4, 2010
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6757	February 8, 2009
Shared memory bank conflicts? CUDA Programming and Performance	0	858	June 4, 2009
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5664	July 9, 2010
char global memory access optimization CUDA Programming and Performance	17	11974	May 31, 2010
Shared memory avoiding bank conflict less effective CUDA Programming and Performance	3	3739	May 6, 2010
Bank Conflicts CUDA Programming and Performance	2	2008	December 6, 2009
result of bank conflict testing is weird! bank conflict CUDA Programming and Performance	4	5394	May 2, 2011
Will this code cause bank conflict ? CUDA Programming and Performance	1	468	October 9, 2018
Accessing Shared memory: Unexpected timing result "Naive" access seems to be twice as fast a CUDA Programming and Performance	3	1201	February 27, 2012

Problem with bank conflict. Something wrong with my experiment?Confused!

Related topics