I would like to know the size of each bank and the number of banks which are there.
In particular, I want to store nThreads*sizeof(float4) in shared memory.
Each thread would access 1 float4 value, like thread 0 will access 0th float4 and 1st thread will access 1st floaf4 and so on.
I read through the Programming guide but could not figure how to reduce bank conflicts. Is padding a way to do it?
And do bank conflicts occur when threads in a half warp access same bank or all 32 threads in a warp access the same thread?
Please help.
When copying data from/to/between shared memory I use the following simple function to avoid bank conflicts:
template <typename T>
__device__ void memCopy(T *destination, T *source, int size) {
int *dest=(int *)destination;
int *src=(int *)source;
for (int tid=threadIdx.x;tid<size*sizeof(T)/4;tid+=blockDim.x)
dest[tid]=src[tid];
}
The above function will work with an array of any kind of objects, provided their size and alignment are multiply of 4 (size of 32-bit int).
Copying an array of float4 directly will lead to 4-way bank conflicts.
However, if I am not mistaken, loading from global memory 128-bit per thread rather than 32-bit is faster. With those two observations regarding global and shared memory access pattern I wonder what is the fastest method to copy unstructured data from global to shared?