example project "transpose"

MatzeGTO · March 13, 2009, 8:15am

Hi,

I got a question regarding the sdk example project “transpose”.

In the file “tranpose_kernel.cu” (lines 45 to 50) it is said that the amount of shared memory which is allocated is BLOCK_DIM * (BLOCK_DIM + 1). Furthermore it says that due to this there will be no bank conflicts.

I dont understand that… I only have to use BLOCK_DIM * BLOCK_DIM to have enough shared memory to store the data, but why is there one more row in the shared memory?

Maybe someone could explain or give me a little hint!

Greetings from Germany,
MatzeGTO

Jamie_K · March 13, 2009, 2:28pm

For clarity, consider the block as a one dimensional array that’s addressed like this:

block[row*BLOCK_DIM + col]

With each row having a multiple of 16 elements, there will be bank conflicts when all the threads read or write to a single column, for example when reading column zero:

block[threadIdx.x*BLOCK_DIM + 0]

In this case the threads are attempting to access block[0], block[16], block[32], … and all these locations are in the same bank, which means they are forced to occur serially.

By using a row padded to BLOCK_DIM+1, i.e.

block[row*(BLOCK_DIM+1) + col]

It does take a little bit more space, but then the accesses to a single column

block[threadIdx.x*(BLOCK_DIM+1) + 0]

become block[0], block[17], block[34], … which are all in different banks and therefore occur simultaneously.

Topic		Replies	Views
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6869	February 8, 2009
Bank conflicts with 2D shared mem array Resolving bank conflicts CUDA Programming and Performance	1	2070	July 18, 2008
The question of the example of "3.2.2.3 Shared Memory in Matrix Multiplication(C=A*A(T)" i CUDA Programming and Performance	0	1937	September 17, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	16489	November 19, 2025
Will this code cause bank conflict ? CUDA Programming and Performance	1	500	October 9, 2018
Question about shared memory banking and conflicting reading... CUDA Programming and Performance	2	927	March 22, 2015
shared memory without bank conflict slower than that with bank conflict CUDA Programming and Performance	2	963	November 28, 2019
Why are they padding only one matrix? CUDA Programming and Performance	8	598	August 19, 2023
beginner question regarding shared memory CUDA Programming and Performance	4	7049	November 16, 2009
Better performance with smaller block size & bank conflicts !!! CUDA Programming and Performance	2	559	September 23, 2017

example project "transpose"

Related topics