Bank Conflicts


I would like to know the size of each bank and the number of banks which are there.
In particular, I want to store nThreads*sizeof(float4) in shared memory.
Each thread would access 1 float4 value, like thread 0 will access 0th float4 and 1st thread will access 1st floaf4 and so on.
I read through the Programming guide but could not figure how to reduce bank conflicts. Is padding a way to do it?
And do bank conflicts occur when threads in a half warp access same bank or all 32 threads in a warp access the same thread?
Please help.

shared memory has 16KB and 16 banks, so each bank has 1kB.

suppose one thread block has 256 threads and use 1-D thread block,

then you can copy 4 float ( not one float4 ) to avoid bank-conflict.

__global__ void foo( float4 *A )


	__shared__ float4  A_sh[ 256 ];

	int th_id = threadIdx.x;

	float *A_ptr = (float*) A + th_id;

	float *A_sh_ptr = (float*) A_sh + th_id;


	for( i = 0; i < 4; i++){

		A_ptr[0] = A_sh_ptr[0];

		A_ptr += 256;

		A_sh_ptr += 256;




basic unit of access-pattern is half-warp, not a warp.

When copying data from/to/between shared memory I use the following simple function to avoid bank conflicts:

template <typename T>

__device__ void memCopy(T *destination, T *source, int size) {

	int *dest=(int *)destination;

	int *src=(int *)source;

	for (int tid=threadIdx.x;tid<size*sizeof(T)/4;tid+=blockDim.x)



The above function will work with an array of any kind of objects, provided their size and alignment are multiply of 4 (size of 32-bit int).

Copying an array of float4 directly will lead to 4-way bank conflicts.

However, if I am not mistaken, loading from global memory 128-bit per thread rather than 32-bit is faster. With those two observations regarding global and shared memory access pattern I wonder what is the fastest method to copy unstructured data from global to shared?