Having read the NVIDIA guides and elsewhere on the topic of shared memory, there is still one thing which I can’t understand.
[codebox]extern shared int sdata; //assume data is copied into shared memory
{
o_data[2thid]=sdata[2thid];
o_data[2thid+1]=sdata[2thid+1];
}[/codebox]
In this cases we access 2 elements of shared memory by each thread and write to o_data. This is bound to have bank conflict as the first 8 threads would have accessed data from all 16 banks and the 9th thread will access data from the first bank again.
My question is that is there any way that bank conflict can ever be removed in such and other cases where we use one thread to access 2 elements as there is no bank left after 8th thread and accessing always takes place in terms of half warps.
@laughingrice
But still I’m saying that whatever way u implement, if one threads access 2 elements, there is bound to be a shortage of unused banks as 8 threads at any time would have used all 16 banks and 16 threads have to run at a time.
Even if you add blockDim.x in the index as you mentioned, this will access data from some unused bank (and yes, coalesced if a multiple of 16) only upto 8th thread. But what about the 9th thread? It’ll again have to access data from some already used bank (and all are used uptill 8th thread).
Also I’ve read the REDUCTIONS whitepaper given in CUDA SDK. I could understand it all but one thing that how the bank conflict was removed.
[codebox]for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}[/codebox]
It was written that this code will suffer from bank conflicts which is true. Here each thread accesses 2 elements from consecutive banks. The modification given to remove bank conflict was:
[codebox]for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
[/codebox]
How was the bank conflict removed? This just made the access sequential. Each thread now accesses the first element and the element at a distance blockDim.x/2 (and halved in further iterations). But the same thing applies here that first 8 threads would have accessed all 16 banks and no unused bank for 9th thread.
Also the reason cited for removal of bank conflict was that “sequential access is conflict free” which is wrong as it has been shown in NVIDIA Programming Guide that even random access is conflict free if each bank is accessed by only one thread.
Bank conflicts is an issue of bank addressing, not of the number of shared memory accesses made by the tread.
Additionally, each shared memory access is handled by itself. You need to look at each of your transactions separately.
Bank conflicts arise from the fact that you use even addresses for the first access, so you have two way bank conflicts (you are not using odd addresses for that transactions) and then you are doing the same thing with the odd addresses.
What I proposed is to use concurrent addresses for both the first and second transactions
if ((threadIdx.x & 1) == 0)
{
o_data[threadIdx.x] = sdata[threadIdx.x];
}
__syncthreads(); // No Bank-Conflicts but only half of data copied. Figure out a way to do full copy.
Another solution would be to check chapter 39, GPU Gems – see the last part on how they avoid smem conflicts. Its on the internet. Not sure if they use similar technique or not… but would be worth checking…
Oh… Now I get you. So you mean that the two lines that I wrote are two separate transaction altogether. In that case “o_data[2thid]=sdata[2thid];” this transaction will suffer bank conflict since I’m using only even addresses. True. Ok. But what about this.
[codebox]for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] = sdata[tid] + sdata[tid + s];
}
__syncthreads();
}[/codebox]
Now suppose we have only 1 block of dimension 32 and all the 32 elements are already loaded into shared memory using sdata[tid]=i_data[tid]. Now, 32 bit words (or 4 bytes) are stored in consecutive banks. So, the data will be stored in such a manner that first 16 elements are stored consecutively in the 16 banks and the 17th element will again go to first bank, 18th to second… and 32nd to 16th. Now, considering the first iteration, s=16. When tid is 0, then first and 17th element are addressed by the zeroth thread which both lie in the first bank. So, CONFLICT. Similarly for tid=1,2…15. There will be a 2 way conflict. But this code (taken from REDUCTIONS.pdf in SDK) says that this is conflict free.
Also in this picture which was alongwith this code in the pdf Reduction Image the depiction of memory addressing is true only when blockDim.x = 16.
First of all, a thread can’t have a bank conflict with itself, since each access is a transaction in it’s own right. For each specific shared memory access you need to see that the 16 concurrent threads of a half warp each accesses a different bank
sdata[tid] = sdata[tid] + sdata[tid + s];
^ transaction 1, concurrent addresses, no bank conflict
^ transaction 2, concurrent addresses, no bank conflict
^ transaction 3, concurrent addresses, no bank conflict