Bank Conflicts

mayank · December 5, 2009, 8:49pm

Hey,

I would like to know the size of each bank and the number of banks which are there.
In particular, I want to store nThreads*sizeof(float4) in shared memory.
Each thread would access 1 float4 value, like thread 0 will access 0th float4 and 1st thread will access 1st floaf4 and so on.
I read through the Programming guide but could not figure how to reduce bank conflicts. Is padding a way to do it?
And do bank conflicts occur when threads in a half warp access same bank or all 32 threads in a warp access the same thread?
Please help.

LSChien · December 6, 2009, 2:22am

shared memory has 16KB and 16 banks, so each bank has 1kB.

suppose one thread block has 256 threads and use 1-D thread block,

then you can copy 4 float ( not one float4 ) to avoid bank-conflict.

__global__ void foo( float4 *A )

{

	__shared__ float4  A_sh[ 256 ];

	int th_id = threadIdx.x;

	float *A_ptr = (float*) A + th_id;

	float *A_sh_ptr = (float*) A_sh + th_id;

	

	for( i = 0; i < 4; i++){

		A_ptr[0] = A_sh_ptr[0];

		A_ptr += 256;

		A_sh_ptr += 256;

	}

	....

}

basic unit of access-pattern is half-warp, not a warp.

Cygnus_X1 · December 6, 2009, 3:27am

When copying data from/to/between shared memory I use the following simple function to avoid bank conflicts:

template <typename T>

__device__ void memCopy(T *destination, T *source, int size) {

	int *dest=(int *)destination;

	int *src=(int *)source;

	for (int tid=threadIdx.x;tid<size*sizeof(T)/4;tid+=blockDim.x)

		dest[tid]=src[tid];

}

The above function will work with an array of any kind of objects, provided their size and alignment are multiply of 4 (size of 32-bit int).

Copying an array of float4 directly will lead to 4-way bank conflicts.

However, if I am not mistaken, loading from global memory 128-bit per thread rather than 32-bit is faster. With those two observations regarding global and shared memory access pattern I wonder what is the fastest method to copy unstructured data from global to shared?

Topic		Replies	Views
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2631	March 31, 2010
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	893	December 29, 2011
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3469	August 20, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10128	January 16, 2025
float4 Shared memory doesn't yield bank conflict according to nvprof when it should CUDA Programming and Performance	4	1926	January 13, 2024
What is the fastest way to copy 512 bytes from global to shared memory? CUDA Programming and Performance	5	981	December 24, 2014
Shared memory bank conflict CUDA Programming and Performance	1	287	May 19, 2024
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6658	February 8, 2009
Bank conflicts with 2D shared mem array Resolving bank conflicts CUDA Programming and Performance	1	2012	July 18, 2008
About async copy CUDA Programming and Performance	9	47	May 8, 2025

Bank Conflicts

Related topics