I don’t understand why there are bank conflicts if the shared memory size is BLOCK_DIM*BLOCK_DIM instead of (BLOCK_DIM+1)*BLOCK_DIM in the transpose example distributed with the Cuda SDK. Shouldn’t each thread be accessing a different bank, so the shared memory doesn’t need to be padded? I assume it has something to do with the “odata[index_out] = block[threadIdx.x][threadIdx.y];” line, but I don’t quite see it. Thanks in advance.
Edit: Nevermind, I think I get it. The bank indexing wraps around for the shared memory for the 16 banks. So supposing block was a 1d array, bank 0 contains the shared memory for block, block, block… etc.
If block was defined as “shared float block[BLOCK_DIM][BLOCK_DIM];”, bank 0 would contain block, block, block, etc. Then 16-way bank conflicts occur when thread 0 is reading block and thread 1 is reading block , etc in the first half-warp.
Defining block as “shared float block[BLOCK_DIM][BLOCK_DIM+1];” puts block in bank 0, block  in bank 1, block  in bank 2 … etc so the bank conflicts are avoided.
Relevant code bit from transpose_kernel.cu
#define BLOCK_DIM 16
// This kernel is optimized to ensure all global reads and writes are coalesced,
// and to avoid bank conflicts in shared memory. This kernel is up to 11x faster
// than the naive kernel below. Note that the shared memory array is sized to
// (BLOCK_DIM+1)*BLOCK_DIM. This pads each row of the 2D block in shared memory
// so that bank conflicts do not occur when threads address the array column-wise.
__global__ void transpose(float *odata, float *idata, int width, int height)
__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];
// read the matrix tile into shared memory
unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;
unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;
if((xIndex < width) && (yIndex < height))
unsigned int index_in = yIndex * width + xIndex;
block[threadIdx.y][threadIdx.x] = idata[index_in];
// write the transposed matrix tile to global memory
xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;
yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;
if((xIndex < height) && (yIndex < width))
unsigned int index_out = yIndex * height + xIndex;
odata[index_out] = block[threadIdx.x][threadIdx.y];
You need to read the programming guide that comes with the SDK. It’s explained pretty well (and in detail) in that document. Without knowing the fundamentals of the architecture, it’s going to be tough for you to program pretty much anything in CUDA.
See Cuda Programming Guide 2.0, Section 188.8.131.52 Shared Memory
Shared memory is memory that all the threads in a block can access. Shared memory is separate from global memory, which all threads and all blocks can access. Bank conflicts can occur in shared memory if multiple threads in the same half-warp try to access the same shared memory bank, since all 16 threads in the half-warp run at the same time. All of the shared memory is divided into 16 memory banks, where the shared memory wraps around the banks as described above.
Shared memory is of size 16KB. It is divided into 16 banks each having 1KB. In the shared memory successive 32 bit words belong to successive banks(e.g., if we access the 18 th word it belongs to 18%16 = 2nd bank ). Each bank has a bandwidth of 32 bits per clock cycle i.e., at any clock cycle a bank can give only 32 bits i.e., a word. So, if we want to access 2 words from the same bank then the accessing will be serialized because of the bandwidth constraint and we call it as “Bank Conflict”. There are 16 banks and 16 threads of a warp (half-warp) accessing the shared memory at a clock cycle. So bank conflict can only occur among the thread s of a half-warp and it can’t occur between two different half-warps.