Why am I getting better performance with per column vs per row for matrix addition?

Phaino · March 26, 2017, 4:34am

I’m beginning to learn CUDA, but I’ve stumbled across an issue that has perplexed me. I’ve written two kernels: one that adds a matrix to another by row per thread and another which adds by column per thread.

I assumed that the per row version would be faster as CUDA arrays are organized with a row-major layout. However, after profiling this with NVIDIA Nsight I found that the column version was consistently faster for large data sets. The column version was about 2x-4x faster depending on block size. I’m not sure sure why this is the case as memory locality should be better on the row version as the column version has a memory pattern which is more spread out per thread.

Anyone have an explanation for the results that I’m seeing?

I’ve also attached a single element per thread addition which is consistently faster than both the row and column implementations.

/** Each threads adds a single element of the matrices
  * only supports 2D square matrices
  */
__global__
void MatrixAdditionSingleElementKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
	u32 row = blockIdx.x * blockDim.x + threadIdx.x;
	u32 col = blockIdx.y * blockDim.y + threadIdx.y;

	if (row < dimensions && col < dimensions)
	{
		u32 uniqueIdx = row * dimensions + col;
		outMatrix[uniqueIdx] = matrixA[uniqueIdx] + matrixB[uniqueIdx];
	}
}

/** Each threads adds a row of the matrices
  * only supports 2D square matrices
  */
__global__
void MatrixAdditionRowKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
	u32 row = blockIdx.x * blockDim.x + threadIdx.x;

	if (row < dimensions)
	{
		u32 rowStartIdx = row * dimensions; 

		for (u32 colIdx = 0; colIdx < dimensions; ++colIdx)
		{
			const u32 curIdx = rowStartIdx + colIdx;
			outMatrix[curIdx] = matrixA[curIdx] + matrixB[curIdx];
		}
	}
}

/** Each threads adds a column of the matrices
  * only supports 2D square matrices
  */
__global__
void MatrixAdditionColumnKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
	u32 col = blockIdx.x * blockDim.x + threadIdx.x;

	if (col < dimensions)
	{
		for (u32 rowIdx = 0; rowIdx < dimensions; ++rowIdx)
		{
			const u32 curIdx = rowIdx * dimensions + col;
			outMatrix[curIdx] = matrixA[curIdx] + matrixB[curIdx];
		}
	}
}

Robert_Crovella · March 26, 2017, 2:29pm

This has to do with memory coalescing in CUDA, i.e. efficient use of the memory subsystem.

When each thread is reading a column of data, then adjacent threads in a warp, at each memory read instruction, are loading adjacent data from memory. This is the most optimal usage of the memory subsystem.

When each thread is reading a row of data, then adjacent threads in a warp are requesting data that is separated by the row width. This is less efficient.

This presentation may be of interest:

[url]http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf[/url]

It’s necessary to think about what adjacent threads in a warp are doing instruction-by-instruction, in order to understand coalescing.

Topic		Replies	Views
Need explanation CUDA Programming and Performance	3	6441	February 22, 2008
CUDA perormances CUDA Programming and Performance	10	7127	January 22, 2008
Relationship between Thread Block dimension and warps CUDA Programming and Performance cuda , kernel	4	471	April 22, 2024
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6002	March 25, 2010
Trade offs between loading cost of loading to shared memory and working directly on global memory CUDA Programming and Performance	4	448	November 8, 2021
Difference between CORE and GRID CUDA Programming and Performance	5	2725	June 9, 2012
CUDA motivation for multi-dimensional kernel execution CUDA Programming and Performance	6	4131	December 8, 2013
Perfomance question need some explanation CUDA Programming and Performance	2	688	February 3, 2012
There must be a reason, but I can not figure out:( CUDA Programming and Performance	8	8089	January 3, 2008
Summing the rows and columns of a 2D array CUDA Programming and Performance	5	9682	August 29, 2016

Why am I getting better performance with per column vs per row for matrix addition?

Related topics