Why i get performance in this Kernel

Admirer4 · July 10, 2008, 6:27am

hey guys,

relating to the kernel of matrix transpose, i have implemented my design which was the same as the “navie” design which in the SDK , after that i take a look of the optimal design , i have understand it , but i don’t know what’s the source of the performance in that design , you may look at the code , it’s doing almost the same thing in a little diffrence that it saves the values in temporary matrix (shared) , after that it copys them to the destination , so we have 2 cycles of memory read/ write , but also in the navie design the program looks at A(x,y) [memory read]

and copy it to B(y,x) [memory write] , so it’s the same relating to read\write cycles , but if you run the 2 kernels you will see that design B is too faster than the naive one ,

can you tell me what’s the diffrence ?

Design A: navie design for matrix transpose

__global__ void transpose_naive(float *odata, float* idata, int width, int height)

{

   unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

   unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

   

   if (xIndex < width && yIndex < height)

   {

       unsigned int index_in  = xIndex + width * yIndex;

       unsigned int index_out = yIndex + height * xIndex;

       odata[index_out] = idata[index_in]; 

   }

}

Design B: Optimized kernel for matrix transpose

__global__ void transpose(float *odata, float *idata, int width, int height)

{

	__shared__ float block[BLOCK_DIM][BLOCK_DIM+1];

	

	// read the matrix tile into shared memory

	unsigned int xIndex = blockIdx.x * BLOCK_DIM + threadIdx.x;

	unsigned int yIndex = blockIdx.y * BLOCK_DIM + threadIdx.y;

	//if((xIndex < width) && (yIndex < height))

	{

  unsigned int index_in = yIndex * width + xIndex;

  block[threadIdx.y][threadIdx.x] = idata[index_in];

	}

	__syncthreads();

	// write the transposed matrix tile to global memory

	xIndex = blockIdx.y * BLOCK_DIM + threadIdx.x;

	yIndex = blockIdx.x * BLOCK_DIM + threadIdx.y;

	//if((xIndex < height) && (yIndex < width))

	{

  unsigned int index_out = yIndex * height + xIndex;

  odata[index_out] = block[threadIdx.x][threadIdx.y];

	}

}

oreo1 · July 10, 2008, 7:29am

You should take a look here:

http://www.astrogpu.org/videos.php[/url]]AstroGPU docs

In “Nvidia: CUDA tutorial”, there are 2 optimization pdf where transpose_kernel is detailed. If the pdf aint enough, check the videos.

So as to sum up, in the optimized kernel, all access are coalesced (that’s not the case for naive one). Moreover, bank conflicts are avoided with the “+1”. It’s well explained in the pdf.

oYo.

Admirer4 · July 13, 2008, 7:33am

i still having problem with that, can you explain more about coalesced access to the memory ? and the bank conflicts ?
I have removed “+1” from the code B. (which i think is related to the bank conflict but i still have successful running (why) …

MisterAnderson42 · July 13, 2008, 11:57am

It is all spelled out very well in the CUDA programming guide. If you have a specific question related to a part of the guide’s description of coalescing/bank conflicts that you don’t understand we can answer it here, but explaining coalescing in full isn’t really possible on the forums.

And do read the whitepaper pdf that goes with the SDK example: it is very well written and fully explains the reasoning behind every line of code.

Topic		Replies	Views
about __syncthreads() in SDK/project/transpose CUDA Programming and Performance	5	2768	September 18, 2009
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2840	October 30, 2020
An Efficient Matrix Transpose in CUDA Fortran Technical Blog	2	461	February 5, 2014
Checking Performance learning how to optimize CUDA codes CUDA Programming and Performance	4	2142	October 7, 2008
The question of the example of "3.2.2.3 Shared Memory in Matrix Multiplication(C=A*A(T)" i CUDA Programming and Performance	0	1917	September 17, 2009
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6757	February 8, 2009
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1300	February 26, 2009
transpose example, SDK 3.2 CUDA Programming and Performance	4	7503	March 15, 2011
example project "transpose" CUDA Programming and Performance	1	2032	March 13, 2009
Transpose performance CUDA Programming and Performance	0	2367	July 11, 2008

Why i get performance in this Kernel

Related topics