Combining sums

MissSusan · November 25, 2008, 3:00pm

Hi all, I’m new to CUDA. I have a good understanding of how it works however, but my main difficulty atm is that i’m new to parallel programming in general.

Could anyone give me some pointers of how you might tackle the following problem?

I have 2 arrays of data. The first contains IDs, and the second contains a contribution for that specific index. Eg:

[0, 0, 0, 1, 1, 2, 2, 2, 0, 0]

[0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1]

I want to sum up the contributions for each ID. So in this example the answer would be:

1: 0.3
2: 0.6

Any help is greatly appreciated :)

QD4_33 · November 27, 2008, 5:25pm

Hello MissSusan,

I tried to compute the sums of rows in a matrix and I think this problem is smiliar to yours.

My first try to realise your idea, would be to take a kernel configuration, which helps to avoid the search of your ids.

A basic approach is to use the register blockIdx.y as ID for a reduction ( if your IDs are less than 65536 ) to sum all elements with the accordingly ID.

That way, you can sum up 65536 x 512 Elements with every blockIdx.y row.

This is the host code an a kernel configuration, i would use…

...

			// n is the number of elements

			int n = ...

			dim3 sum_DimBlock, sum_DimGrid;

			// maxThreads is the maximum number of threads you want to use or your device can deliver

			// we only need n / 2 threads in the first reduction circle to sum n elements

			sum_DimBlock.x = ( n < maxThreads << 1 ) ? (int)ceil( n / 2.0 ) : maxThreads;

			sum_DimGrid.x = (int)ceil( n / (float)( sum_DimBlock.x << 1 ));

			sum_DimGrid.y = maximumID;

			sums<<< sum_DimGrid, sum_DimBlock, sizeof( float ) * sum_DimBlock.x>>>( n );

			while( sum_DimGrid.x > 1 )

			{

				n = sum_DimGrid.x;

				sum_DimBlock.x = ( n < maxThreads << 1 ) ? (int)ceil( n / 2.0 ) : maxThreads;

				sum_DimGrid.x = (int)ceil( n / (float)( sum_DimBlock.x << 1 ));

				sums<<< sum_DimGrid, sum_DimBlock, sizeof( float ) * sum_DimBlock.x >>>( n );

			}

...

The sums kernel can be something like:

__global__ void sums( int n )

{

	// array in shared memory, use kernel configuration to specify the used memory

	extern __shared__ float T_s[];

	unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int ixb = ix + gridDim.x * blockDim.x;

	

	// load id from global memory to a register to avoid multiple global mem reads

	// following ..._cPtr arrays are pointers in constant memory ( you don't need to send it with every kernel call ) that point to an array in global memory

	float actualId_r = id_array_cPtr[ ix ];

	unsigned int actualIndex_r = index_array_cPtr[ ix ];

	float secondId_r = ( ixb < n ? id_array_cPtr[ ixb ] : -1 ); 

	unsigned int secondIndex_r = ( ixb < n ? index_array_cPtr[ ixb ] : 0 );

	// check, if the elements have the right ID

	T_s[ threadIdx.x ] = ( actualId_r == blockIdx.y ? actualIndex_r : 0 ) + ( secondId_r == blockIdx.y ? secondIndex_r : 0 )

	__syncthreads();

	// sum all elements within the block

	unsigned int s_old = blockDim.x;

	for( int s = blockDim.x >> 1; s > 0; s >>= 1 )

	{

		if( threadIdx.x < s )

		{

			T_s[ threadIdx.x ] += T_s[ threadIdx.x + s ];

			if( threadIdx.x == 0 && s_old != s << 1 ) // if n is not even, take the last element

			{

				T_s[ 0 ]  += T_s[ s_old - 1 ];

			}

		}

		s_old = s;

		__syncthreads();

	}

	if( threadIdx.x == 0 )

	{

		index_array_cPtr[ blockIdx.y ] = blockIdx.y; // that is a bad idea, should not work... choose another location in the source array or use a buffer array

		id_array_cPtr[ blockIdx.y ] = T_s[0];

	}

}

I did not test this code. It is just an idea. I used a similiar one to sum matrix rows. It works for me on a tesla and a quadro device, but failed with higher dimensions on a geforce 9800gt under openSuSE 11 (this problem is still unsolved, but i think it is a driver problem).

Propably this code is not very efficient. You can try to decide to use a buffer array to avoid finding the right ids after the first kernel run ( the results are located at the “beginning of a matrix line”.

The rewrite to global memory is wrong, as I can see at the moment. But I’m out of time, sorry ^^

Good Luck