Help understanding bank conflicts in transpose example

CUDA_Moose · December 4, 2008, 11:29pm

I don’t understand why there are bank conflicts if the shared memory size is BLOCK_DIM*BLOCK_DIM instead of (BLOCK_DIM+1)*BLOCK_DIM in the transpose example distributed with the Cuda SDK. Shouldn’t each thread be accessing a different bank, so the shared memory doesn’t need to be padded? I assume it has something to do with the “odata[index_out] = block[threadIdx.x][threadIdx.y];” line, but I don’t quite see it. Thanks in advance.

Edit: Nevermind, I think I get it. The bank indexing wraps around for the shared memory for the 16 banks. So supposing block was a 1d array, bank 0 contains the shared memory for block[0], block[16], block[32]… etc.

If block was defined as “shared float block[BLOCK_DIM][BLOCK_DIM];”, bank 0 would contain block[0][0], block[1][0], block[2][0], etc. Then 16-way bank conflicts occur when thread 0 is reading block[0][0] and thread 1 is reading block [1][0], etc in the first half-warp.

Defining block as “shared float block[BLOCK_DIM][BLOCK_DIM+1];” puts block[0][0] in bank 0, block [1][0] in bank 1, block [2][0] in bank 2 … etc so the bank conflicts are avoided.

Relevant code bit from transpose_kernel.cu

#define BLOCK_DIM 16

// This kernel is optimized to ensure all global reads and writes are coalesced,

// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster

// than the naive kernel below.  Note that the shared memory array is sized to 

// (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory 

// so that bank conflicts do not occur when threads address the array column-wise.

__global__ void transpose(float *odata, float *idata, int width, int height)

{

	__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

	

	// read the matrix tile into shared memory

	unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

	unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

	if((xIndex < width) && (yIndex < height))

	{

		unsigned int index_in = yIndex * width + xIndex;

		block[threadIdx.y][threadIdx.x] = idata[index_in];

	}

	__syncthreads();

	// write the transposed matrix tile to global memory

	xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

	yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;

	if((xIndex < height) && (yIndex < width))

	{

		unsigned int index_out = yIndex * height + xIndex;

		odata[index_out] = block[threadIdx.x][threadIdx.y];

	}

}

kyzhao · December 5, 2008, 3:18am

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
33 34 35 ã€‚ã€‚ã€‚ã€‚

get it?
there are 16 banks…

kittyspecial · December 12, 2008, 6:00pm

I am really confused :">

Can I ask what is bank? memory?

I just couldn’t get it. How the memory would be conflict?

Different thread address the same memory at the same time? To me it seems not possible…Different block will acceess different part of the global memory, am I right?

jack · December 12, 2008, 6:25pm

You need to read the programming guide that comes with the SDK. It’s explained pretty well (and in detail) in that document. Without knowing the fundamentals of the architecture, it’s going to be tough for you to program pretty much anything in CUDA.

CUDA_Moose · December 12, 2008, 9:26pm

See Cuda Programming Guide 2.0, Section 5.1.2.5 Shared Memory

Shared memory is memory that all the threads in a block can access. Shared memory is separate from global memory, which all threads and all blocks can access. Bank conflicts can occur in shared memory if multiple threads in the same half-warp try to access the same shared memory bank, since all 16 threads in the half-warp run at the same time. All of the shared memory is divided into 16 memory banks, where the shared memory wraps around the banks as described above.

B_Sujith_Kumar · February 8, 2009, 10:40am

Shared memory is of size 16KB. It is divided into 16 banks each having 1KB. In the shared memory successive 32 bit words belong to successive banks(e.g., if we access the 18 th word it belongs to 18%16 = 2nd bank ). Each bank has a bandwidth of 32 bits per clock cycle i.e., at any clock cycle a bank can give only 32 bits i.e., a word. So, if we want to access 2 words from the same bank then the accessing will be serialized because of the bandwidth constraint and we call it as “Bank Conflict”. There are 16 banks and 16 threads of a warp (half-warp) accessing the shared memory at a clock cycle. So bank conflict can only occur among the thread s of a half-warp and it can’t occur between two different half-warps.

Topic		Replies	Views
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2626	March 31, 2010
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3467	August 20, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9573	January 16, 2025
Does every thread block have its own 32 shared memory banks? CUDA Programming and Performance cuda	8	1599	February 6, 2023
do not understand bank conflicts please help CUDA Programming and Performance	7	2689	December 22, 2012
the relation between Thread Index and Shared Memory CUDA Programming and Performance	4	3236	February 14, 2009
example project "transpose" CUDA Programming and Performance	1	1995	March 13, 2009
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
Bank Conflict when each thread accesses 2 elements CUDA Programming and Performance	8	5582	July 9, 2010
question about the shared memory CUDA Programming and Performance	4	3865	October 30, 2007

Help understanding bank conflicts in transpose example

Related topics