The question of the example of " Shared Memory in Matrix Multiplication(C=A*A(T)" i

The section,Shared Memory in Matrix Multiplication(C = AA[T]), in NVIDIA_CUDA_BestPracticesGuide_2.3.pdf, shows a multiplication of a matrix and its transposed version. The optimized code like following:

global void coalescedMultiply(float a, float c, int M)
shared float aTile[TILE_DIM][TILE_DIM], transposedTile[TILE_DIM][TILE_DIM];
int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
aTile[threadIdx.y][threadIdx.x] = a[row
transposedTile[threadIdx.x][threadIdx.y] = a[(blockIdx.x
blockDim.x + threadIdx.y)*TILE_DIM + threadIdx.x];

for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[threadIdx.y][i]* transposedTile[i][threadIdx.x];
c[row*M+col] = sum;

The document says, when the matrix transposedTile is being written, banks conflicts will occur. I don’t know why it does so. Here, TILE_DIM eauals to 16, so transposedTile is a 16*16 matrix.
1)When the code is being executed, every thread in a half-warp’s 16 threads will write one element of the same row of transposedTile(right?). Every element’s size is 4byts(32 bits), exactly 1 bank. If i was right, every thread in a halp-warp’s 16 thread would write within a separate bank, but how banks conflict come from?
2)If banks conflict really occur as the document says, why if transposedTile was defined as transposedTile[TILE_DIM][TILE_DIM+1] will eliminates the conflicts entirely! Why?

If anybody know the principle well, any help will appreciated.