Summing the rows and columns of a 2D array

I have been working on a problem which required summation of rows followed by summation of the columns of a 2D array (matrix). I noticed the column summation was faster than the row summation, which to me goes against what I learned about memory access coalescence.
I extracted the summations to test this in isolation and found that column summations were about twice as fast as the row summations. I would have expected this to be the other way around since the row summations read contiguous data. I am obviously missing something, or being completely blind, and was wondering if anyone can see the reason for this. Below are the kernels I tested.

The first sums the value 1.0 across the rows:

static __global__ void dRowSum(float* d_row_sum) {

	// get indecies
	unsigned int s_tx = threadIdx.x;

	unsigned int s_ty = threadIdx.y;
	unsigned int d_ty = BLOCKSIZE * blockIdx.y + threadIdx.y;

	// Create shared mem CP
	__shared__ float s_C[BLOCKSIZE][BLOCKSIZE];

	// @@ ... ms
	s_C[s_ty][s_tx] = 1.0;
	__syncthreads();

	// Sum reduce each column of C
	// @@ ... ms
	for (unsigned int stride = blockDim.x / 2; stride >= 1; stride >>= 1) {
		__syncthreads();
		if (s_tx < stride) {
			s_C[s_ty][s_tx] += s_C[s_ty][s_tx + stride];
		}
	}
	__syncthreads();

	// Save col 0 to global mem
	// @@ ... ms
	if (s_tx == 0) {
		d_row_sum[d_ty] = s_C[s_ty][0];
	}

}

The second sums the value 1.0 across the columns:

static __global__ void dColSum(float* d_col_sum) {

	// get indecies
	unsigned int s_tx = threadIdx.x;
	unsigned int d_tx = BLOCKSIZE * blockIdx.x + threadIdx.x;

	unsigned int s_ty = threadIdx.y;


	// Create shared mem NFC
	__shared__ float s_C[BLOCKSIZE][BLOCKSIZE];

	// @@ ... ms
	s_C[s_ty][s_tx] = 1.0;
	__syncthreads();

	// Sum reduce each column of C
	// @@ ... ms
	for (unsigned int stride = blockDim.y / 2; stride >= 1; stride >>= 1) {
		__syncthreads();
		if (s_ty < stride) {
			s_C[s_ty][s_tx] += s_C[s_ty + stride][s_tx];
		}
	}
	__syncthreads();

	// Save row 0 to global mem
	// @@ ... ms
	if (s_ty == 0) {
		d_col_sum[d_tx] = s_C[0][s_tx];
	}

}

For a 500x500 block test on a Quadro FX1800M (compute capability 1.2), I got these results:

Grid setup 500 x 500 blocks :: 16 * 16 threads
Running: !!
Row Sum Time : 1018.276978
Col Sum Time : 390.657013

But whyyyyy???

Any pointers in the right direction would be greatly appreciated.

There is no reason to use shared memory to sum the columns of a (C-style, row-major) matrix.

But to answer your question, the difference lies in the efficiency of use of the available memory access slots, per cycle. Both kernels are loading adjacent elements of memory using adjacent threads, so there is some coalescing behavior in each case.

First, let’s remind ourselves of the warp organization of your 16x16 threadblock, using warp 0 as an example:

warp lane:   0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ... 31
threadIdx.x: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1     15
threadIdx.y: 0 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  1  1      1

Now let’s consider the row-sum case:

for (unsigned int stride = blockDim.x / 2; stride >= 1; stride >>= 1) {
  __syncthreads();	
  if (s_tx < stride) {
    s_C[s_ty][s_tx] += s_C[s_ty][s_tx + stride];
    }
  }

As we proceed through the above for-loop, we reduce the participating threads from right to left in the warp. At each iteration, only threads that have threadIdx.x < stride will participate, but the entire warp must remain active, as long as there is one active thread. Furthermore we see at each read operation, we using first 16, then 8, then 4, then 2, then 1 element, but the entire read cycle must be issued.

Now let’s contrast this with the column-sum case:

for (unsigned int stride = blockDim.y / 2; stride >= 1; stride >>= 1) {
  __syncthreads();	
  if (s_ty < stride) {
    s_C[s_ty][s_tx] += s_C[s_ty + stride][s_tx];
    }
  }

Here, as we iterate through the for-loop, we are reducing the the participating threads from the “bottom” up. That means, as stride is decreased, we are causing entire warps to become inactive, and these inactive warps do not generate any read cycles. Furthermore, for the remaining active warps, all 32 threads are active and consuming data in each read cycle.

Therefore the column-sum case maintains full bandwidth utilization (of shared memory in this case) whereas the row-sum case does not.

But you can write a more efficient column sum kernel that does not use shared memory.

If you only care about the final result, it will be quicker to sum all the elements without worrying about rows and columns.

@txbob, Fantastic, thanks for the great explanation. Makes complete sense now.
I also never really thought about not using shared memory. I’m still not sure if this is possible in the actual algorithm since in each kernel other calculations are performed before the row / column summations, but I am going to investigate it. I shall come back with more questions on how to implement non shared memory column summations if that’s ok.

If you have already loaded shared memory in this fashion (perhaps to do the row-sums, which typically will be faster with a classical parallel reduction in shared memory), then it makes sense to re-use that data in shared memory, and not load it again from global memory. But when considered in isolation, a column-summing kernel will derive no benefit from shared memory. A simple example would be something like this:

template <typename T>
__global__ void column_sum(const T* matrix, const unsigned width, const unsigned height, T* result){

  unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
  while (idx < width){
    T my_result = (T)0;
    for (int i = 0; i < height; i++) my_result += matrix[(i*width)+idx];
    result[idx] = my_result;
    idx += gridDim.x * blockDim.x;}

}

This type of column-summing assumes C-style row-major underlying storage.

Algorithm to find sum of rows and column of a matrix

  • Initialize an array rowSum of length m with zero, to store sum of elements of m rows of matrix.
  • Initialize an array colSumof length n with zero, to store sum of elements of n columns of matrix
  • Any element A[i][j] of matrix is part of ith row and jth column. Hence, we will add A[i][j] to sum of ith row and jth column. rowSum[i] = rowSum[i] + A[i][j], and colSum[j] = colSum[j] + A[i][j].
  • Once we traverse whole array, we will get row and column sum in rowSum and colSum array respectively.

Source : http://www.techcrashcourse.com/2015/03/c-program-sum-each-row-and-column.html