I want to calculate the sum of the 512 lines

Pimbolie1979 · January 4, 2013, 11:22am

Hallo Guys

I have an array with 2048 columns and 512 rows. Now I wanto to calculate the sume of 512 lines.

Sum = row_0 + row_1 + row_2 + row_511

Here is my code but the result is wrong:

__global__ void Kernel_Baseline_Summation(unsigned short *Input_Data, double *Result, int number_of_columns, int number_of_rows)
{	
	int tidx = blockIdx.x * blockDim.x + threadIdx.x;
	int tidy = blockIdx.y * blockDim.y + threadIdx.y;

	if( (tidx < number_of_columns) && (tidy < number_of_rows) )
	{
		Result[tidx] = Result[tidx] + Input_Data[tidy * number_of_columns + tidx];

		__syncthreads();
	}
	else
	{
		return;
	}
}



dim3 dimGrid;
	dim3 dimBlock;

	dimBlock.x = 32;
	dimBlock.y = 1;

	dimGrid.x = number_of_columns / 32;
	dimGrid.y = number_of_rows;

	Kernel_Baseline_Summation(Input_Data, Result, number_of_columns, number_of_rows);

Pimbolie1979 · January 4, 2013, 1:06pm

I have modify my code. Now the calculation is correct but the performance is very slow :-(

I need 0,25ms for 2048x512 values.

Can sombody help me to optimize my code with shared memory pls.

1 Thread calculate the sum of one column. Here is my code:

__global__ void Kernel_Baseline_Summation(unsigned short *Input_Data, double *Result, int number_of_columns, int number_of_rows)
{	
	unsigned short i;

	int tidx = blockIdx.x * blockDim.x + threadIdx.x;

	if( tidx < number_of_columns)
	{
		for(i = 0; i < number_of_rows; i++)
		{
			Result[tidx] = Result[tidx] + Input_Data[i * number_of_columns + tidx];	
		}

		__syncthreads();
	}
	else
	{
		return;
	}
} 


Kernel_Baseline_Summation(Input_Data, Result, number_of_columns, number_of_rows);

pasoleatis · January 4, 2013, 1:34pm

The line 29 looks funny. Something is wrong with the code formatting on the forum

I hope I got it right…

You have input[512][2048] and you want output[512] with output[i]=sum_j (intput[i][j]).

__global__ void Baseline_Summation(int *input,int *output)
int tid=threadIdx.x;
__shared__ int temp[2048]; // 'static' shared memory declaration
int ii;
ii=blockIdx.x*2048;
temp[tid]=intput[ii+tid];
temp[tid+512]=intput[ii+512+tid];
temp[tid+1024]=intput[ii+1024+tid];
temp[tid+1536]=intput[ii+1536+tid];

temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
__syncthreads();
for(unsigned int s=256; s>=1; s=s/2)
{
if(tid< s)
{
temp[tid] += temp[tid + s];
}
//////////////////////////////
__syncthreads();
}
if(tid == 0) output[blockIdx.x] = temp[0];
}

// in the program use

Baseline_Summation< < < 512,512 > > >(intput,output);

This will work for input[512][2048],output[512], the kernel is launched with 512 blocks, with 512 threads for each block (for other grid configurations the kernel needs to be changed)

Pimbolie1979 · January 4, 2013, 2:16pm

This is to complicated for me :-)

At monday I get some books for cuda programming.

pasoleatis · January 4, 2013, 2:26pm

you can still check if this code works for your data.

Pimbolie1979 · January 4, 2013, 2:50pm

There a misstake in the code.

Is temp the shared variable?

Pimbolie1979 · January 4, 2013, 3:04pm

I have tested your code. But the result is wrong. But the code is faster :-)

In the future I will calculate the the avarage of this array (Result = (row 0 + row 1 + row n)/n).

Here is the code:

__global__ void Kernel_Calculate_New_Baseline(unsigned short *Input_Data, double *Result)
{
	int tid=threadIdx.x;
	__shared__ double temp[2048]; // 'static' shared memory declaration

	int ii;
	ii=blockIdx.x*2048;
	temp[tid]=Input_Data[ii+tid];
	temp[tid+512]=Input_Data[ii+512+tid];
	temp[tid+1024]=Input_Data[ii+1024+tid];
	temp[tid+1536]=Input_Data[ii+1536+tid];

	temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
	__syncthreads();

	for(unsigned int s=256; s>=1; s=s/2)
	{
		if(tid< s)
		{
			temp[tid] += temp[tid + s];
		}

		__syncthreads();
	}

	if(tid == 0) Result[blockIdx.x] = temp[0];
}

Kernel_Calculate_New_Baselinee< < < 512,512 > > >(Input_Data,Result);

pasoleatis · January 4, 2013, 3:44pm

now fixed. I am surpsised it give the wron result. It is very similar to my other codes I use for summations

__global__ void Baseline_Summation(int *input,int *output)
int tid=threadIdx.x;
__shared__ int temp[2048]; // 'static' shared memory declaration
int ii;
ii=blockIdx.x*2048;
temp[tid]=intput[ii+tid];
temp[tid+512]=intput[ii+512+tid];
temp[tid+1024]=intput[ii+1024+tid];
temp[tid+1536]=intput[ii+1536+tid];

temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
__syncthreads();
for(unsigned int s=256; s>=1; s=s/2)
{
if(tid< s)
{
temp[tid] += temp[tid + s];
}
//////////////////////////////
__syncthreads();
}
if(tid == 0) output[blockIdx.x] = temp[0];
}

// in the program use

Baseline_Summation< < < 512,512 > > >(intput,output);

For the last step you obtain a vector with 512 elements which can be reduced in the same way, but now only with 1 block.

Pimbolie1979 · January 4, 2013, 4:20pm

It is possible that you add the wrong values?

Result[0] = Input_Data[0] + Input_Data[2048] + Input_Data[4096] + Input_Data[n *2048]
Result[1] = Input_Data[1] + Input_Data[2048 + 1] + Input_Data[4096 + 1] + Input_Data[n *2048 + 1]

pasoleatis · January 4, 2013, 4:28pm

The code I posted is for a matrix input[512][2048] and it gives the sum along the second index output[512]. Is your problem different?

Pimbolie1979 · January 4, 2013, 4:36pm

External Media

I have tested the kernel with consatnt input datas.

Pimbolie1979 · January 4, 2013, 4:45pm

My matrix input is [512][2048] too.

All output values have the same values

if(tid == 0) Result[blockIdx.x] = temp[0];

pasoleatis · January 4, 2013, 4:52pm

If the input is a matrix input[2048][512] and output[2048], the code is a little different:

__global__ void Baseline_Summation(int *input,int *output)
    int tid=threadIdx.x;
    __shared__ int temp[512]; // 'static' shared memory declaration
    int ii;
    ii=blockIdx.x*512;
    temp[tid]=intput[ii+tid];
    __syncthreads();
    for(unsigned int s=256; s>=1; s=s/2)
    {
    if(tid< s)
    {
    temp[tid] += temp[tid + s];
    }
    //////////////////////////////
    __syncthreads();
    }
    if(tid == 0) output[blockIdx.x] = temp[0];
    }

    // in the program use

    Baseline_Summation< < < 2048,512 > > >(intput,output);

there are 2048 blocks with 512 threads pr block.

Pimbolie1979 · January 4, 2013, 5:11pm

All Outputs have the same value

if(tid == 0) output[blockIdx.x] = temp[0];

pasoleatis · January 4, 2013, 5:16pm

Well I think I misunderstood your question, but the code should work if the data is properly put in the matrix. The line you highlighted does not mean that the output is the same. temp array is shared and it is local to each block. temp[0] collects the sum in each block. There are 2048 blocks and temp[0] will be different (local) in each block.

Pimbolie1979 · January 4, 2013, 5:30pm

But you can see in my picture ALL RESULTS have the same value :-)

pasoleatis · January 4, 2013, 5:43pm

The code does not work for your problem, but it is correct.
It is important how do you construct the input array. In my examples I map the 2D matrix to a 1D array. The element 2dinput[i][j] is store in 1dinput[j+i*nj] where i=0 -->ni-1 and j=0 → nj-1. Both examples I posted sums along the second dimension but in one case ni=512, nj=2048, while in the other case ni=2048,nj=512. I am alawys confuesed about what one means when referring to the rows/lines of a matrix

You need to expose you problem in a concise manner. The reduction problem is quite straight forward and you will be able to implement quite easy.

Topic		Replies	Views
CUDA - calculation of a sum CUDA Programming and Performance	7	5408	April 30, 2010
Summing threads CUDA Programming and Performance	3	3070	June 7, 2011
Timing comparison(ms) in calculation of the sum of matrix rows CUDA Programming and Performance cuda , kernel	1	448	October 26, 2022
Sum of N numbers in parallel in pairs without repetition. CUDA Programming and Performance	23	2586	December 20, 2011
Reduce sum in shared memory using CUB CUDA Programming and Performance cuda , kernel , performance	9	82	October 3, 2024
Summing the rows and columns of a 2D array CUDA Programming and Performance	5	9600	August 29, 2016
Issue with addition of shared memory and thread indexing CUDA Programming and Performance	0	395	March 22, 2017
Thread block clusters and distributed shared memory not working as intended CUDA Programming and Performance	8	1239	November 8, 2023
Interpretation of Kernel CUDA Programming and Performance	4	3082	August 11, 2009
Iteration help in CUDA CUDA Programming and Performance	11	6833	April 19, 2012

I want to calculate the sum of the 512 lines

Related topics