Hi everyone,

I have a very simple problem, but I can’t find an efficient way to solve it with CUDA.

For a little matrix, like 100x100, I want to calculate the sum of every element of the matrix. Actually it’s a vector from 0 to 9999.

I thought to give to each blocks one line and some threads would calculate the sum for this line in shared memory, and then one of the thread would do the sum of the shared memory vector … but i don’t think that would be really more efficient than the same thing on CPU …

An other idea is to make a reduction until a given point (each thread would add 2 elements until the number of elements isn’t a multiple of 2 …) but in some cases it would do nothing so …

I look forward to seeing your solutions !

Thank you in advance.

ps : actually, I want to do that for a sub-matrix of a bigger matrix, but I think the problem remain the same.

EDIT : I tried that but it’s slower : (

```
__global__ void mean_calculation_kernel(float* d_Data, int data_size,float* blocktmp)
{
const int threadsPerBlock = 512;
__shared__ float cache[threadsPerBlock];
int offset = threadIdx.x + blockIdx.x*blockDim.x;
if(offset<data_size)
{
cache[threadIdx.x]=d_Data[offset];
__syncthreads();
int i = blockDim.x/2;
while (i!=0)
{
if(threadIdx.x < i)
{
cache[threadIdx.x] += cache[threadIdx.x + i];
}
i /= 2;
__syncthreads();
}
}
if(threadIdx.x == 0)
{
blocktmp[blockIdx.x] = cache[0];
}
}
```

with

```
float mean_calculation(float* d_Data, int data_size)
{
int T=512; // number of threads
const int B = (data_size+T-1)/T;
float mean=0;
float *h_mean = (float *)malloc(B*sizeof(float));;
float *blocktmp;
cutilSafeCall(cudaMalloc( (void**)&blocktmp, B * sizeof(float) ) );
mean_calculation_kernel<<<B,T>>>(d_Data,data_size,blocktmp);
cutilSafeCall( cudaMemcpy(h_mean, blocktmp, sizeof(float), cudaMemcpyDeviceToHost) );
for(int i =0; i<B;i++)
{
mean+=h_mean[i];
}
cutilSafeCall( cudaFree(blocktmp));
free(h_mean);
return mean/data_size;
}
```