Summation and inter-thread communication


One of the algorithms in the code I am porting to CUDA contains lots of summations. I have looked through the reduction and “scan” examples in the SDK and a PDF file on the subject (also from the cuda website), but there are still a few things I didn’t quite understand.

For example, in the code I’m porting, one of the operations involves doing the hadamard product of two matrices (C = A o B, extremely simple in cuda!), and then “collapsing” the resulting matrix into a vector by doing a summation of each column’s elements.
It is the summation over the columns that is not to straight forward. The PDF document I read presented an algorithm for doing summation, but it was only for a linear array where all elements were already known up-front.
In my case the elements are only known after each thread performs it’s own element-wise product.

The only way I could think of solving this is by storing the resulting matrix and then using a separate kernel doing the column sums individually. But this means allocating more storage space (the CPU code needs only the vector), and it needs to make to access matrix “C”'s memory position twice.

Is there a way I could communicate the results between threads in the same column to do the summation?


How big is a column in your matrix? If you organize your threads so that each row is a block, and each column is assigned to a different thread in a block, then you could have a kernel which does this:

// Note that I assume A and B are stored in a particular order (see below)

__global__ void multiply_and_sum(int rows, int cols, float *A, float *B, float *output)


extern __shared__ temp_product[];

int row = blockIdx.x;

int col = threadIdx.x;

int index = row * cols + col;

temp_product[col] = A[index] * B[index];


// Do a reduction in shared memory here.  I'm too lazy to write out the code, but you can find examples

// I assume the sum of all elements will be at index 0

if (threadIdx.x == 0)

  output[row] = temp_product[0];


Note that this kernel uses dynamically sized shared memory, so you have to use the third argument in the kernel call:

temp_product<<<rows, cols, sizeof(float) * cols>>>(rows, cols, A, B, output)

If your matrix has more than about 4000 columns, this won’t work because you’ll run out of shared memory.

Given matrix A, B with dimension N x K (say N rows, K columns) and

assume A and B are stored as col-major

I declare block = 32 x 8, so one block has 8 warps corresponding to 8 columns

each warp compute partial sum of one column, then collect them.

i.e. suppose a warp compute j-column, its CPU version is

[codebox]Y(j) = 0 ; // Y(j) is sum of j-col of C = A o B

for tid = 1:32

for i = tid:32:N

Y(j) += A(i,j) * B(i,j)


end [/codebox]

kernel is

[codebox]#define BLOCKSIZE_X 32

#define BLOCKSIZE_Y 8

global void hadamard_and_sum(const float *A, const float *B, float *Y, const int N, const int K)


__shared__ float sdata[BLOCKSIZE_Y][BLOCKSIZE_X + 16 ];

int idx = threadIdx.x ; // idx = i-1

int idy = threadIdx.y ;

int colIndex =  blockIdx.x * BLOCKSIZE_Y + idy ; // colIndex = j-1

if( colIndex < K){

// compute start index of colume colIndex of A and B

	int iter = colIndex*N;

// compute partial sum of elements A(i,j)*B(i,j)

	sdata[idy][idx] = 0.0f ; 

	for (int n = idx; n < N; n+= BLOCKSIZE_X ){

// col-major(i,j) = (j-1)*N + (i-1)

// sum over A(i:BLOCKSIZE_X:N, j) * B(i:BLOCKSIZE_X:N, j)

		sdata[idy][idx] += A[iter + n] * B[iter + n] ;



// collect partial sum

	sdata[idy][idx] += sdata[idy][idx + 16];

	sdata[idy][idx] += sdata[idy][idx + 8 ];

	sdata[idy][idx] += sdata[idy][idx + 4 ];

	sdata[idy][idx] += sdata[idy][idx + 2 ];

	sdata[idy][idx] += sdata[idy][idx + 1 ];	

// store back to Y

	if (0 == idx ) Y[ colIndex ] = sdata[idy][0] ;	

}// if (idx < K )



C-wrapper is


void hadamard_and_sum_wrapper( float *A_d, float *B_d, float *Y_d, int N , int K )


int blockSize, nBlocks ;


nBlocks = K/blockSize + (K%blockSize == 0?0:1);

// 1-D grid, 2-D threads block

hadamard_and_sum <<<nBlocks, blockSize>>> (A_d, B_d, Y_d, N, K);


Thank you both very much for the replies!
I will study this and post back :)

Just one quick question about the C wrapper: does the nvcc compiler automatically create the correct dim3 type for blockSize and nBlocks based on the way you calculated their integer values?