example project "transpose"


I got a question regarding the sdk example project “transpose”.

In the file “tranpose_kernel.cu” (lines 45 to 50) it is said that the amount of shared memory which is allocated is BLOCK_DIM * (BLOCK_DIM + 1). Furthermore it says that due to this there will be no bank conflicts.

I dont understand that… I only have to use BLOCK_DIM * BLOCK_DIM to have enough shared memory to store the data, but why is there one more row in the shared memory?

Maybe someone could explain or give me a little hint!

Greetings from Germany,

For clarity, consider the block as a one dimensional array that’s addressed like this:

block[row*BLOCK_DIM + col]

With each row having a multiple of 16 elements, there will be bank conflicts when all the threads read or write to a single column, for example when reading column zero:

block[threadIdx.x*BLOCK_DIM + 0]

In this case the threads are attempting to access block[0], block[16], block[32], … and all these locations are in the same bank, which means they are forced to occur serially.

By using a row padded to BLOCK_DIM+1, i.e.

block[row*(BLOCK_DIM+1) + col]

It does take a little bit more space, but then the accesses to a single column

block[threadIdx.x*(BLOCK_DIM+1) + 0]

become block[0], block[17], block[34], … which are all in different banks and therefore occur simultaneously.