Summing matrix elements

RyuKa · June 23, 2011, 11:52am

Hi everyone,

I have a very simple problem, but I can’t find an efficient way to solve it with CUDA.

For a little matrix, like 100x100, I want to calculate the sum of every element of the matrix. Actually it’s a vector from 0 to 9999.

I thought to give to each blocks one line and some threads would calculate the sum for this line in shared memory, and then one of the thread would do the sum of the shared memory vector … but i don’t think that would be really more efficient than the same thing on CPU …

An other idea is to make a reduction until a given point (each thread would add 2 elements until the number of elements isn’t a multiple of 2 …) but in some cases it would do nothing so …

I look forward to seeing your solutions !

Thank you in advance.

ps : actually, I want to do that for a sub-matrix of a bigger matrix, but I think the problem remain the same.

EDIT : I tried that but it’s slower : (

__global__ void mean_calculation_kernel(float* d_Data, int data_size,float* blocktmp)

{

	const int threadsPerBlock = 512;

	__shared__ float cache[threadsPerBlock];

	int offset = threadIdx.x + blockIdx.x*blockDim.x;

	if(offset<data_size)

	{

		cache[threadIdx.x]=d_Data[offset];

		__syncthreads();

		int i = blockDim.x/2;

		while (i!=0)

		{

			if(threadIdx.x < i)

			{

				cache[threadIdx.x] += cache[threadIdx.x + i];

			}

			i /= 2;

			__syncthreads();

		}

	}

	if(threadIdx.x == 0)

	{

		blocktmp[blockIdx.x] = cache[0];

	}

}

with

float mean_calculation(float* d_Data, int data_size)

{

	int T=512; // number of threads

	const int B = (data_size+T-1)/T;

	float mean=0;

	float *h_mean = (float *)malloc(B*sizeof(float));;

	float *blocktmp;

	cutilSafeCall(cudaMalloc( (void**)&blocktmp, B * sizeof(float) ) );

	mean_calculation_kernel<<<B,T>>>(d_Data,data_size,blocktmp);

	cutilSafeCall( cudaMemcpy(h_mean,  blocktmp, sizeof(float), cudaMemcpyDeviceToHost) );

	for(int i =0; i<B;i++)

	{

		mean+=h_mean[i];

	}

	cutilSafeCall( cudaFree(blocktmp));

	free(h_mean);

	return mean/data_size;

}

RyuKa · July 4, 2011, 9:51am

Up, it would be really helpful.

Puksa · July 4, 2011, 12:56pm

Summing elements is efficiently done in the first phase of the (pre)scan algorithm, take a look at the pdf by Mark Harris. Parallel reduction is key to this.

RyuKa · July 4, 2011, 2:27pm

I used parallel reduction …
Maybe badly … I’ll take a loot at the pdf you told me, thank you.

Topic		Replies	Views
sum of all elements of a matrix CUDA Programming and Performance	11	36405	October 18, 2010
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1845	December 23, 2009
Combining sums CUDA Programming and Performance	1	1222	November 27, 2008
Unable to access the correct matrix elements through threads CUDA Programming and Performance	5	682	May 27, 2017
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	852	July 16, 2010
CUDA - calculation of a sum CUDA Programming and Performance	7	5451	April 30, 2010
matrix column sums CUDA Programming and Performance	0	2058	November 25, 2008
CUDA mean. CUDA Programming and Performance	1	3240	April 7, 2009
Cuda Speedup CUDA Programming and Performance	3	2235	October 20, 2009
scatter and gather with CUDA? CUDA Programming and Performance	3	9842	March 9, 2009

Summing matrix elements

Related topics