I want to calculate the sum of the 512 lines

Hallo Guys

I have an array with 2048 columns and 512 rows. Now I wanto to calculate the sume of 512 lines.

Sum = row_0 + row_1 + row_2 + row_511

Here is my code but the result is wrong:

__global__ void Kernel_Baseline_Summation(unsigned short *Input_Data, double *Result, int number_of_columns, int number_of_rows)
{	
	int tidx = blockIdx.x * blockDim.x + threadIdx.x;
	int tidy = blockIdx.y * blockDim.y + threadIdx.y;

	if( (tidx < number_of_columns) && (tidy < number_of_rows) )
	{
		Result[tidx] = Result[tidx] + Input_Data[tidy * number_of_columns + tidx];

		__syncthreads();
	}
	else
	{
		return;
	}
}



dim3 dimGrid;
	dim3 dimBlock;

	dimBlock.x = 32;
	dimBlock.y = 1;

	dimGrid.x = number_of_columns / 32;
	dimGrid.y = number_of_rows;

	Kernel_Baseline_Summation(Input_Data, Result, number_of_columns, number_of_rows);

I have modify my code. Now the calculation is correct but the performance is very slow :-(

I need 0,25ms for 2048x512 values.

Can sombody help me to optimize my code with shared memory pls.

1 Thread calculate the sum of one column. Here is my code:

__global__ void Kernel_Baseline_Summation(unsigned short *Input_Data, double *Result, int number_of_columns, int number_of_rows)
{	
	unsigned short i;

	int tidx = blockIdx.x * blockDim.x + threadIdx.x;

	if( tidx < number_of_columns)
	{
		for(i = 0; i < number_of_rows; i++)
		{
			Result[tidx] = Result[tidx] + Input_Data[i * number_of_columns + tidx];	
		}

		__syncthreads();
	}
	else
	{
		return;
	}
} 


Kernel_Baseline_Summation(Input_Data, Result, number_of_columns, number_of_rows);

The line 29 looks funny. Something is wrong with the code formatting on the forum

I hope I got it right…

You have input[512][2048] and you want output[512] with output[i]=sum_j (intput[i][j]).

__global__ void Baseline_Summation(int *input,int *output)
int tid=threadIdx.x;
__shared__ int temp[2048]; // 'static' shared memory declaration
int ii;
ii=blockIdx.x*2048;
temp[tid]=intput[ii+tid];
temp[tid+512]=intput[ii+512+tid];
temp[tid+1024]=intput[ii+1024+tid];
temp[tid+1536]=intput[ii+1536+tid];

temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
__syncthreads();
for(unsigned int s=256; s>=1; s=s/2)
{
if(tid< s)
{
temp[tid] += temp[tid + s];
}
//////////////////////////////
__syncthreads();
}
if(tid == 0) output[blockIdx.x] = temp[0];
}

// in the program use

Baseline_Summation< < < 512,512 > > >(intput,output);

This will work for input[512][2048],output[512], the kernel is launched with 512 blocks, with 512 threads for each block (for other grid configurations the kernel needs to be changed)

This is to complicated for me :-)

At monday I get some books for cuda programming.

you can still check if this code works for your data.

There a misstake in the code.

Is temp the shared variable?

I have tested your code. But the result is wrong. But the code is faster :-)

In the future I will calculate the the avarage of this array (Result = (row 0 + row 1 + row n)/n).

Here is the code:

__global__ void Kernel_Calculate_New_Baseline(unsigned short *Input_Data, double *Result)
{
	int tid=threadIdx.x;
	__shared__ double temp[2048]; // 'static' shared memory declaration

	int ii;
	ii=blockIdx.x*2048;
	temp[tid]=Input_Data[ii+tid];
	temp[tid+512]=Input_Data[ii+512+tid];
	temp[tid+1024]=Input_Data[ii+1024+tid];
	temp[tid+1536]=Input_Data[ii+1536+tid];

	temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
	__syncthreads();

	for(unsigned int s=256; s>=1; s=s/2)
	{
		if(tid< s)
		{
			temp[tid] += temp[tid + s];
		}

		__syncthreads();
	}

	if(tid == 0) Result[blockIdx.x] = temp[0];
}

Kernel_Calculate_New_Baselinee< < < 512,512 > > >(Input_Data,Result);

now fixed. I am surpsised it give the wron result. It is very similar to my other codes I use for summations

__global__ void Baseline_Summation(int *input,int *output)
int tid=threadIdx.x;
__shared__ int temp[2048]; // 'static' shared memory declaration
int ii;
ii=blockIdx.x*2048;
temp[tid]=intput[ii+tid];
temp[tid+512]=intput[ii+512+tid];
temp[tid+1024]=intput[ii+1024+tid];
temp[tid+1536]=intput[ii+1536+tid];

temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
__syncthreads();
for(unsigned int s=256; s>=1; s=s/2)
{
if(tid< s)
{
temp[tid] += temp[tid + s];
}
//////////////////////////////
__syncthreads();
}
if(tid == 0) output[blockIdx.x] = temp[0];
}

// in the program use

Baseline_Summation< < < 512,512 > > >(intput,output);

For the last step you obtain a vector with 512 elements which can be reduced in the same way, but now only with 1 block.

It is possible that you add the wrong values?

Result[0] = Input_Data[0] + Input_Data[2048] + Input_Data[4096] + Input_Data[n *2048]
Result[1] = Input_Data[1] + Input_Data[2048 + 1] + Input_Data[4096 + 1] + Input_Data[n *2048 + 1]

The code I posted is for a matrix input[512][2048] and it gives the sum along the second index output[512]. Is your problem different?

External Media

I have tested the kernel with consatnt input datas.

My matrix input is [512][2048] too.

All output values have the same values

if(tid == 0) Result[blockIdx.x] = temp[0];

If the input is a matrix input[2048][512] and output[2048], the code is a little different:

__global__ void Baseline_Summation(int *input,int *output)
    int tid=threadIdx.x;
    __shared__ int temp[512]; // 'static' shared memory declaration
    int ii;
    ii=blockIdx.x*512;
    temp[tid]=intput[ii+tid];
    __syncthreads();
    for(unsigned int s=256; s>=1; s=s/2)
    {
    if(tid< s)
    {
    temp[tid] += temp[tid + s];
    }
    //////////////////////////////
    __syncthreads();
    }
    if(tid == 0) output[blockIdx.x] = temp[0];
    }

    // in the program use

    Baseline_Summation< < < 2048,512 > > >(intput,output);

there are 2048 blocks with 512 threads pr block.

All Outputs have the same value

if(tid == 0) output[blockIdx.x] = temp[0];

Well I think I misunderstood your question, but the code should work if the data is properly put in the matrix. The line you highlighted does not mean that the output is the same. temp array is shared and it is local to each block. temp[0] collects the sum in each block. There are 2048 blocks and temp[0] will be different (local) in each block.

But you can see in my picture ALL RESULTS have the same value :-)

The code does not work for your problem, but it is correct.
It is important how do you construct the input array. In my examples I map the 2D matrix to a 1D array. The element 2dinput[i][j] is store in 1dinput[j+i*nj] where i=0 -->ni-1 and j=0 → nj-1. Both examples I posted sums along the second dimension but in one case ni=512, nj=2048, while in the other case ni=2048,nj=512. I am alawys confuesed about what one means when referring to the rows/lines of a matrix

You need to expose you problem in a concise manner. The reduction problem is quite straight forward and you will be able to implement quite easy.