I got a question regarding the sdk example project “transpose”.
In the file “tranpose_kernel.cu” (lines 45 to 50) it is said that the amount of shared memory which is allocated is BLOCK_DIM * (BLOCK_DIM + 1). Furthermore it says that due to this there will be no bank conflicts.
I dont understand that… I only have to use BLOCK_DIM * BLOCK_DIM to have enough shared memory to store the data, but why is there one more row in the shared memory?
Maybe someone could explain or give me a little hint!
Greetings from Germany,
For clarity, consider the block as a one dimensional array that’s addressed like this:
block[row*BLOCK_DIM + col]
With each row having a multiple of 16 elements, there will be bank conflicts when all the threads read or write to a single column, for example when reading column zero:
block[threadIdx.x*BLOCK_DIM + 0]
In this case the threads are attempting to access block, block, block, … and all these locations are in the same bank, which means they are forced to occur serially.
By using a row padded to BLOCK_DIM+1, i.e.
block[row*(BLOCK_DIM+1) + col]
It does take a little bit more space, but then the accesses to a single column
block[threadIdx.x*(BLOCK_DIM+1) + 0]
become block, block, block, … which are all in different banks and therefore occur simultaneously.