Summation and inter-thread communication

superscalar · September 5, 2009, 2:55pm

Hi,

One of the algorithms in the code I am porting to CUDA contains lots of summations. I have looked through the reduction and “scan” examples in the SDK and a PDF file on the subject (also from the cuda website), but there are still a few things I didn’t quite understand.

For example, in the code I’m porting, one of the operations involves doing the hadamard product of two matrices (C = A o B, extremely simple in cuda!), and then “collapsing” the resulting matrix into a vector by doing a summation of each column’s elements.
It is the summation over the columns that is not to straight forward. The PDF document I read presented an algorithm for doing summation, but it was only for a linear array where all elements were already known up-front.
In my case the elements are only known after each thread performs it’s own element-wise product.

The only way I could think of solving this is by storing the resulting matrix and then using a separate kernel doing the column sums individually. But this means allocating more storage space (the CPU code needs only the vector), and it needs to make to access matrix “C”'s memory position twice.

Is there a way I could communicate the results between threads in the same column to do the summation?

Thanks!

seibert · September 5, 2009, 3:22pm

How big is a column in your matrix? If you organize your threads so that each row is a block, and each column is assigned to a different thread in a block, then you could have a kernel which does this:

// Note that I assume A and B are stored in a particular order (see below)

__global__ void multiply_and_sum(int rows, int cols, float *A, float *B, float *output)

{

extern __shared__ temp_product[];

int row = blockIdx.x;

int col = threadIdx.x;

int index = row * cols + col;

temp_product[col] = A[index] * B[index];

__syncthreads();

// Do a reduction in shared memory here.  I'm too lazy to write out the code, but you can find examples

// I assume the sum of all elements will be at index 0

if (threadIdx.x == 0)

  output[row] = temp_product[0];

}

Note that this kernel uses dynamically sized shared memory, so you have to use the third argument in the kernel call:

temp_product<<<rows, cols, sizeof(float) * cols>>>(rows, cols, A, B, output)

If your matrix has more than about 4000 columns, this won’t work because you’ll run out of shared memory.

LSChien · September 5, 2009, 4:58pm

Given matrix A, B with dimension N x K (say N rows, K columns) and

assume A and B are stored as col-major

I declare block = 32 x 8, so one block has 8 warps corresponding to 8 columns

each warp compute partial sum of one column, then collect them.

i.e. suppose a warp compute j-column, its CPU version is

[codebox]Y(j) = 0 ; // Y(j) is sum of j-col of C = A o B

for tid = 1:32

for i = tid:32:N

Y(j) += A(i,j) * B(i,j)

end

end [/codebox]

kernel is

[codebox]define BLOCKSIZE_X 32

define BLOCKSIZE_Y 8

global void hadamard_and_sum(const float *A, const float *B, float *Y, const int N, const int K)

{

__shared__ float sdata[BLOCKSIZE_Y][BLOCKSIZE_X + 16 ];

int idx = threadIdx.x ; // idx = i-1

int idy = threadIdx.y ;

int colIndex =  blockIdx.x * BLOCKSIZE_Y + idy ; // colIndex = j-1

if( colIndex < K){

// compute start index of colume colIndex of A and B

	int iter = colIndex*N;

// compute partial sum of elements A(i,j)*B(i,j)

	sdata[idy][idx] = 0.0f ; 

	for (int n = idx; n < N; n+= BLOCKSIZE_X ){

// col-major(i,j) = (j-1)*N + (i-1)

// sum over A(i:BLOCKSIZE_X:N, j) * B(i:BLOCKSIZE_X:N, j)

		sdata[idy][idx] += A[iter + n] * B[iter + n] ;

	}

	__syncthreads();

// collect partial sum

	sdata[idy][idx] += sdata[idy][idx + 16];

	sdata[idy][idx] += sdata[idy][idx + 8 ];

	sdata[idy][idx] += sdata[idy][idx + 4 ];

	sdata[idy][idx] += sdata[idy][idx + 2 ];

	sdata[idy][idx] += sdata[idy][idx + 1 ];

// store back to Y

	if (0 == idx ) Y[ colIndex ] = sdata[idy][0] ;	

}// if (idx < K )

}

[/codebox]

C-wrapper is

[codebox]

void hadamard_and_sum_wrapper( float *A_d, float *B_d, float *Y_d, int N , int K )

{

int blockSize, nBlocks ;



blockSize = BLOCKSIZE_X * BLOCKSIZE_Y ;

nBlocks = K/blockSize + (K%blockSize == 0?0:1);

// 1-D grid, 2-D threads block

hadamard_and_sum <<<nBlocks, blockSize>>> (A_d, B_d, Y_d, N, K);

}[/codebox]

superscalar · September 5, 2009, 10:29pm

Thank you both very much for the replies!
I will study this and post back :)

superscalar · September 6, 2009, 1:38am

Just one quick question about the C wrapper: does the nvcc compiler automatically create the correct dim3 type for blockSize and nBlocks based on the way you calculated their integer values?

Topic		Replies	Views
Summing the rows and columns of a 2D array CUDA Programming and Performance	5	9803	August 29, 2016
Iteration help in CUDA CUDA Programming and Performance	11	6897	April 19, 2012
Timing comparison(ms) in calculation of the sum of matrix rows CUDA Programming and Performance cuda , kernel	1	479	October 26, 2022
device global memory update questions CUDA Programming and Performance	7	5846	April 20, 2009
Reduce choice CUDA Programming and Performance	25	60	March 23, 2025
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1194	December 12, 2020
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1847	December 23, 2009
Summing matrix elements CUDA Programming and Performance	3	6938	July 4, 2011
Single Thread Processing a vector of elements General Concept. CUDA Programming and Performance	5	3037	July 27, 2009
sum of all elements of a matrix CUDA Programming and Performance	11	36508	October 18, 2010

Summation and inter-thread communication

Related topics