Bank Conflict when each thread accesses 2 elements

Ambuj_Pandey · July 7, 2010, 6:13pm

Hello everyone.

Having read the NVIDIA guides and elsewhere on the topic of shared memory, there is still one thing which I can’t understand.

[codebox]extern shared int sdata; //assume data is copied into shared memory

{

o_data[2thid]=sdata[2thid];

o_data[2thid+1]=sdata[2thid+1];

}[/codebox]

In this cases we access 2 elements of shared memory by each thread and write to o_data. This is bound to have bank conflict as the first 8 threads would have accessed data from all 16 banks and the 9th thread will access data from the first bank again.

My question is that is there any way that bank conflict can ever be removed in such and other cases where we use one thread to access 2 elements as there is no bank left after 8th thread and accessing always takes place in terms of half warps.

Desperately awaiting the answer.

laughingrice · July 7, 2010, 11:36pm

Hello everyone.

Having read the NVIDIA guides and elsewhere on the topic of shared memory, there is still one thing which I can’t understand.

[codebox]extern shared int sdata; //assume data is copied into shared memory

{

o_data[2thid]=sdata[2thid];

o_data[2thid+1]=sdata[2thid+1];

}[/codebox]

In this cases we access 2 elements of shared memory by each thread and write to o_data. This is bound to have bank conflict as the first 8 threads would have accessed data from all 16 banks and the 9th thread will access data from the first bank again.

My question is that is there any way that bank conflict can ever be removed in such and other cases where we use one thread to access 2 elements as there is no bank left after 8th thread and accessing always takes place in terms of half warps.

Desperately awaiting the answer.

You would want to do something like this instead

[codebox]extern shared int sdata; //assume data is copied into shared memory

{

o_data[thid]=sdata[thid];

o_data[thid + blockDim.x]=sdata[thid + blockDim.x];

}[/codebox]

If you want coalescing, blockDim.x should be a multiple of 16

Ambuj_Pandey · July 8, 2010, 4:01am

@laughingrice
But still I’m saying that whatever way u implement, if one threads access 2 elements, there is bound to be a shortage of unused banks as 8 threads at any time would have used all 16 banks and 16 threads have to run at a time.

Even if you add blockDim.x in the index as you mentioned, this will access data from some unused bank (and yes, coalesced if a multiple of 16) only upto 8th thread. But what about the 9th thread? It’ll again have to access data from some already used bank (and all are used uptill 8th thread).

Please help.

Ambuj_Pandey · July 8, 2010, 4:46am

Also I’ve read the REDUCTIONS whitepaper given in CUDA SDK. I could understand it all but one thing that how the bank conflict was removed.

[codebox]for (unsigned int s=1; s < blockDim.x; s *= 2) {

int index = 2 * s * tid;

if (index < blockDim.x) {

sdata[index] += sdata[index + s];

}

__syncthreads();

}[/codebox]

It was written that this code will suffer from bank conflicts which is true. Here each thread accesses 2 elements from consecutive banks. The modification given to remove bank conflict was:

[codebox]for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s) {

sdata[tid] += sdata[tid + s];

}

__syncthreads();

}

[/codebox]

How was the bank conflict removed? This just made the access sequential. Each thread now accesses the first element and the element at a distance blockDim.x/2 (and halved in further iterations). But the same thing applies here that first 8 threads would have accessed all 16 banks and no unused bank for 9th thread.

Also the reason cited for removal of bank conflict was that “sequential access is conflict free” which is wrong as it has been shown in NVIDIA Programming Guide that even random access is conflict free if each bank is accessed by only one thread.

laughingrice · July 8, 2010, 10:44am

Bank conflicts is an issue of bank addressing, not of the number of shared memory accesses made by the tread.

Additionally, each shared memory access is handled by itself. You need to look at each of your transactions separately.

Bank conflicts arise from the fact that you use even addresses for the first access, so you have two way bank conflicts (you are not using odd addresses for that transactions) and then you are doing the same thing with the odd addresses.

What I proposed is to use concurrent addresses for both the first and second transactions

Sarnath · July 8, 2010, 11:04am

if ((threadIdx.x & 1) == 0)

{

  o_data[threadIdx.x] = sdata[threadIdx.x];

}

__syncthreads(); // No Bank-Conflicts but only half of data copied. Figure out a way to do full copy.

Another solution would be to check chapter 39, GPU Gems – see the last part on how they avoid smem conflicts. Its on the internet. Not sure if they use similar technique or not… but would be worth checking…

Ambuj_Pandey · July 8, 2010, 12:51pm

@laughingrice

Oh… Now I get you. So you mean that the two lines that I wrote are two separate transaction altogether. In that case “o_data[2thid]=sdata[2thid];” this transaction will suffer bank conflict since I’m using only even addresses. True. Ok. But what about this.

[codebox]for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s) {

sdata[tid] = sdata[tid] + sdata[tid + s];

}

__syncthreads();

}[/codebox]

Now suppose we have only 1 block of dimension 32 and all the 32 elements are already loaded into shared memory using sdata[tid]=i_data[tid]. Now, 32 bit words (or 4 bytes) are stored in consecutive banks. So, the data will be stored in such a manner that first 16 elements are stored consecutively in the 16 banks and the 17th element will again go to first bank, 18th to second… and 32nd to 16th. Now, considering the first iteration, s=16. When tid is 0, then first and 17th element are addressed by the zeroth thread which both lie in the first bank. So, CONFLICT. Similarly for tid=1,2…15. There will be a 2 way conflict. But this code (taken from REDUCTIONS.pdf in SDK) says that this is conflict free.

Also in this picture which was alongwith this code in the pdf Reduction Image the depiction of memory addressing is true only when blockDim.x = 16.

Please rescue me out of this.

laughingrice · July 8, 2010, 5:58pm

@laughingrice

Oh… Now I get you. So you mean that the two lines that I wrote are two separate transaction altogether. In that case “o_data[2thid]=sdata[2thid];” this transaction will suffer bank conflict since I’m using only even addresses. True. Ok. But what about this.

[codebox]for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s) {

sdata[tid] = sdata[tid] + sdata[tid + s];

}

__syncthreads();

}[/codebox]

Now suppose we have only 1 block of dimension 32 and all the 32 elements are already loaded into shared memory using sdata[tid]=i_data[tid]. Now, 32 bit words (or 4 bytes) are stored in consecutive banks. So, the data will be stored in such a manner that first 16 elements are stored consecutively in the 16 banks and the 17th element will again go to first bank, 18th to second… and 32nd to 16th. Now, considering the first iteration, s=16. When tid is 0, then first and 17th element are addressed by the zeroth thread which both lie in the first bank. So, CONFLICT. Similarly for tid=1,2…15. There will be a 2 way conflict. But this code (taken from REDUCTIONS.pdf in SDK) says that this is conflict free.

Also in this picture which was alongwith this code in the pdf Reduction Image the depiction of memory addressing is true only when blockDim.x = 16.

Please rescue me out of this.

First of all, a thread can’t have a bank conflict with itself, since each access is a transaction in it’s own right. For each specific shared memory access you need to see that the 16 concurrent threads of a half warp each accesses a different bank

sdata[tid] = sdata[tid] + sdata[tid + s];

                    ^ transaction 1, concurrent addresses, no bank conflict

                                          ^ transaction 2, concurrent addresses, no bank conflict

^ transaction 3, concurrent addresses, no bank conflict

Ambuj_Pandey · July 9, 2010, 2:00am

@laughingrice
Now I got it completely. Thanks a lot for helping me out. :)

Topic		Replies	Views
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	11893	January 16, 2025
Problems Understanding Bank Conflicts CUDA Programming and Performance	1	1727	September 16, 2009
Does this have bank conflict? CUDA Programming and Performance	3	1537	October 31, 2008
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	905	December 29, 2011
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2664	March 31, 2010
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6700	February 8, 2009
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3502	August 20, 2009
CUDA Reduction CUDA Programming and Performance	2	1759	March 1, 2009
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	77	June 23, 2025
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	4474	January 23, 2024

Bank Conflict when each thread accesses 2 elements

Related topics