Hello.

I’m new to CUDA and I’m still a bit unsure about what are the best things to do while programming on CUDA.

I implemented a simplistic mean but can’t help thinking the last part could be much more efficient.

```
// Mean kernel
__global__ void cudamean(float *X,float *R,int jump,int size) {
int x = threadIdx.x,z = blockIdx.x;
int i;
int nt = blockDim.x;
__shared__ float u;
__shared__ float *D;
extern __shared__ char Dmean[];
D = (float *) Dmean;
// Sums values jumping
for (D[x] = 0.0f,i = z*nt+x;i < size;i += jump) {
D[x] += X[i];
}
__syncthreads();
// The first thread sums the values of the other threads within the block
if (x == 0) {
for (i = 0,u = 0.0f;i < nt;i++)
u += D[i];
R[z] = u/size;
}
}
```

The principle is resumed by loading the vector to the graphical device memory, summing values jumping “jump” (jump=blockDim*nblocks) positions) and computing the sum of all thread results.

After the result is written on the vector R, the processor sums all the results from the thread blocks.

I ran this with 512 threads since it’s the maximum my device permits (gtx260).

In the last part of the code, only one thread per block does the heavy work.

Is anyone willing to give me some ideas please (will it be on the algorithm of in that last heavy part)?

I’m posing here because I want to understand a lot more about this, since I’m aware that I’m only in the beginnings.

Thank you.

(If possible, change this post to CUDA Programming and Development section, since I didn’t realize I was posting in the wrong section, thank you).