I’m beginning to learn CUDA, but I’ve stumbled across an issue that has perplexed me. I’ve written two kernels: one that adds a matrix to another by row per thread and another which adds by column per thread.

I assumed that the per row version would be faster as CUDA arrays are organized with a row-major layout. However, after profiling this with NVIDIA Nsight I found that the column version was consistently faster for large data sets. The column version was about 2x-4x faster depending on block size. I’m not sure sure why this is the case as memory locality should be better on the row version as the column version has a memory pattern which is more spread out per thread.

Anyone have an explanation for the results that I’m seeing?

I’ve also attached a single element per thread addition which is consistently faster than both the row and column implementations.

```
/** Each threads adds a single element of the matrices
* only supports 2D square matrices
*/
__global__
void MatrixAdditionSingleElementKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
u32 row = blockIdx.x * blockDim.x + threadIdx.x;
u32 col = blockIdx.y * blockDim.y + threadIdx.y;
if (row < dimensions && col < dimensions)
{
u32 uniqueIdx = row * dimensions + col;
outMatrix[uniqueIdx] = matrixA[uniqueIdx] + matrixB[uniqueIdx];
}
}
/** Each threads adds a row of the matrices
* only supports 2D square matrices
*/
__global__
void MatrixAdditionRowKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
u32 row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < dimensions)
{
u32 rowStartIdx = row * dimensions;
for (u32 colIdx = 0; colIdx < dimensions; ++colIdx)
{
const u32 curIdx = rowStartIdx + colIdx;
outMatrix[curIdx] = matrixA[curIdx] + matrixB[curIdx];
}
}
}
/** Each threads adds a column of the matrices
* only supports 2D square matrices
*/
__global__
void MatrixAdditionColumnKernel(f32* outMatrix, const f32* matrixA, const f32* matrixB, const u32 dimensions)
{
u32 col = blockIdx.x * blockDim.x + threadIdx.x;
if (col < dimensions)
{
for (u32 rowIdx = 0; rowIdx < dimensions; ++rowIdx)
{
const u32 curIdx = rowIdx * dimensions + col;
outMatrix[curIdx] = matrixA[curIdx] + matrixB[curIdx];
}
}
}
```