If I declare two device arrays and copy them into shared memory as
device int array1;
device int array2;
shared int sharray1;
shared int sharray2;
int tid = blockIdx.x*blockDim.x + threadIdx.x;
sharray1[tid] = array1[tid];
sharray2[tid] = array2[tid];
then, assuming a block size of 32, what exactly happens regarding how the arrays are coalesced from global memory into shared memory, and how will the data fill the shared memory if there are 16 banks in shared memory?
And is it always more efficient to work in half warps with data in shared memory due to the 16 banks, and in warps when coalescing data to and from global memory?
And for an array of size 16 in shared memory, is there any difference between the threads reading all the contents of an array from shared memory by them reading the elements in a loop in for example all threads read element 0, then element 1, then element 2,…, finally element 15, as compared to them reading the array staggered by starting at their thread id so that for example thread 3 will start by reading element 3, then element 4, …, element 15, then element 0, element 1 and finally element 2?