Why i get performance in this Kernel

hey guys,

relating to the kernel of matrix transpose, i have implemented my design which was the same as the “navie” design which in the SDK , after that i take a look of the optimal design , i have understand it , but i don’t know what’s the source of the performance in that design , you may look at the code , it’s doing almost the same thing in a little diffrence that it saves the values in temporary matrix (shared) , after that it copys them to the destination , so we have 2 cycles of memory read/ write , but also in the navie design the program looks at A(x,y) [memory read]

and copy it to B(y,x) [memory write] , so it’s the same relating to read\write cycles , but if you run the 2 kernels you will see that design B is too faster than the naive one ,

can you tell me what’s the diffrence ?

Design A: navie design for matrix transpose

__global__ void transpose_naive(float *odata, float* idata, int width, int height)


   unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

   unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;


   if (xIndex < width && yIndex < height)


       unsigned int index_in  = xIndex + width * yIndex;

       unsigned int index_out = yIndex + height * xIndex;

       odata[index_out] = idata[index_in]; 



Design B: Optimized kernel for matrix transpose

__global__ void transpose(float *odata, float *idata, int width, int height)


	__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];


	// read the matrix tile into shared memory

	unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

	unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

	//if((xIndex < width) && (yIndex < height))


  unsigned int index_in = yIndex * width + xIndex;

  block[threadIdx.y][threadIdx.x] = idata[index_in];



	// write the transposed matrix tile to global memory

	xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

	yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;

	//if((xIndex < height) && (yIndex < width))


  unsigned int index_out = yIndex * height + xIndex;

  odata[index_out] = block[threadIdx.x][threadIdx.y];



You should take a look here:

http://www.astrogpu.org/videos.php]AstroGPU docs[/URL]

In “Nvidia: CUDA tutorial”, there are 2 optimization pdf where transpose_kernel is detailed. If the pdf aint enough, check the videos.

So as to sum up, in the optimized kernel, all access are coalesced (that’s not the case for naive one). Moreover, bank conflicts are avoided with the “+1”. It’s well explained in the pdf.


i still having problem with that, can you explain more about coalesced access to the memory ? and the bank conflicts ?
I have removed “+1” from the code B. (which i think is related to the bank conflict but i still have successful running (why) …

It is all spelled out very well in the CUDA programming guide. If you have a specific question related to a part of the guide’s description of coalescing/bank conflicts that you don’t understand we can answer it here, but explaining coalescing in full isn’t really possible on the forums.

And do read the whitepaper pdf that goes with the SDK example: it is very well written and fully explains the reasoning behind every line of code.