Why am I getting better performance with per column vs per row for matrix addition?

I’m beginning to learn CUDA, but I’ve stumbled across an issue that has perplexed me. I’ve written two kernels: one that adds a matrix to another by row per thread and another which adds by column per thread.

I assumed that the per row version would be faster as CUDA arrays are organized with a row-major layout. However, after profiling this with NVIDIA Nsight I found that the column version was consistently faster for large data sets. The column version was about 2x-4x faster depending on block size. I’m not sure sure why this is the case as memory locality should be better on the row version as the column version has a memory pattern which is more spread out per thread.

Anyone have an explanation for the results that I’m seeing?

I’ve also attached a single element per thread addition which is consistently faster than both the row and column implementations.

/** Each threads adds a single element of the matrices
  * only supports 2D square matrices
  */
__global__
void MatrixAdditionSingleElementKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
	u32 row = blockIdx.x * blockDim.x + threadIdx.x;
	u32 col = blockIdx.y * blockDim.y + threadIdx.y;

	if (row < dimensions && col < dimensions)
	{
		u32 uniqueIdx = row * dimensions + col;
		outMatrix[uniqueIdx] = matrixA[uniqueIdx] + matrixB[uniqueIdx];
	}
}

/** Each threads adds a row of the matrices
  * only supports 2D square matrices
  */
__global__
void MatrixAdditionRowKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
	u32 row = blockIdx.x * blockDim.x + threadIdx.x;

	if (row < dimensions)
	{
		u32 rowStartIdx = row * dimensions; 

		for (u32 colIdx = 0; colIdx < dimensions; ++colIdx)
		{
			const u32 curIdx = rowStartIdx + colIdx;
			outMatrix[curIdx] = matrixA[curIdx] + matrixB[curIdx];
		}
	}
}

/** Each threads adds a column of the matrices
  * only supports 2D square matrices
  */
__global__
void MatrixAdditionColumnKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
	u32 col = blockIdx.x * blockDim.x + threadIdx.x;

	if (col < dimensions)
	{
		for (u32 rowIdx = 0; rowIdx < dimensions; ++rowIdx)
		{
			const u32 curIdx = rowIdx * dimensions + col;
			outMatrix[curIdx] = matrixA[curIdx] + matrixB[curIdx];
		}
	}
}

This has to do with memory coalescing in CUDA, i.e. efficient use of the memory subsystem.

When each thread is reading a column of data, then adjacent threads in a warp, at each memory read instruction, are loading adjacent data from memory. This is the most optimal usage of the memory subsystem.

When each thread is reading a row of data, then adjacent threads in a warp are requesting data that is separated by the row width. This is less efficient.

This presentation may be of interest:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf

It’s necessary to think about what adjacent threads in a warp are doing instruction-by-instruction, in order to understand coalescing.