The line 29 looks funny. Something is wrong with the code formatting on the forum
I hope I got it right…
You have input[512][2048] and you want output[512] with output[i]=sum_j (intput[i][j]).
__global__ void Baseline_Summation(int *input,int *output)
int tid=threadIdx.x;
__shared__ int temp[2048]; // 'static' shared memory declaration
int ii;
ii=blockIdx.x*2048;
temp[tid]=intput[ii+tid];
temp[tid+512]=intput[ii+512+tid];
temp[tid+1024]=intput[ii+1024+tid];
temp[tid+1536]=intput[ii+1536+tid];
temp[tid]=temp[tid]+temp[tid+512]+temp[tid+1024]+temp[tid+1536];
__syncthreads();
for(unsigned int s=256; s>=1; s=s/2)
{
if(tid< s)
{
temp[tid] += temp[tid + s];
}
//////////////////////////////
__syncthreads();
}
if(tid == 0) output[blockIdx.x] = temp[0];
}
// in the program use
Baseline_Summation< < < 512,512 > > >(intput,output);
This will work for input[512][2048],output[512], the kernel is launched with 512 blocks, with 512 threads for each block (for other grid configurations the kernel needs to be changed)
Well I think I misunderstood your question, but the code should work if the data is properly put in the matrix. The line you highlighted does not mean that the output is the same. temp array is shared and it is local to each block. temp[0] collects the sum in each block. There are 2048 blocks and temp[0] will be different (local) in each block.
The code does not work for your problem, but it is correct.
It is important how do you construct the input array. In my examples I map the 2D matrix to a 1D array. The element 2dinput[i][j] is store in 1dinput[j+i*nj] where i=0 -->ni-1 and j=0 → nj-1. Both examples I posted sums along the second dimension but in one case ni=512, nj=2048, while in the other case ni=2048,nj=512. I am alawys confuesed about what one means when referring to the rows/lines of a matrix
You need to expose you problem in a concise manner. The reduction problem is quite straight forward and you will be able to implement quite easy.