Hello all :)

I have got the following kernel I found on the NVIDIA site, and I just wanted to know if i did understand it. It should calculate the sum of all values of the given vector v.

```
#define N (1024)
__global__ void fastSum(float *v) {
__shared__ float *sum;
sum = v;
int tx = threadIdx.x;
int bdx = blockDim.x;
int bx = blockIdx.x;
int t = tx + bx * bdx;
for(int stride = 1; stride < bdx; stride *= 2) {
__syncthreads();
if(t % (2*stride) == 0) {
sum[t] += sum[t+stride];
}
}
__syncthreads();
if(tx == 0 && bx != 0) {
v[0] += sum[bx*bdx];
}
}
```

I know it is not the the best algorithm to compute the sum of the values but thats not my point here.

My thoughts were:

To walk through more blocks than one, I have to add the __syncthreads at the end to wait until each thread of one block is finished. At this point, every thread of a block should have done his computations and then I am telling the thread with the index 0 of this block to add the computed result to the final result. Because the values in my vector are all 1.0, the ouput for N = 256 should be 256, for N = 1024 should be 1024 and so on … however, with this version now it is sometimes the right value, but sometimes not. For example, N = 768 … first time run result is 768 … next time run result is 384 or some other numbers. Where is the problem to be found in my code?

Thank you for helping me!

Regards,

A.