 # Interpretation of Kernel

Hello all :)

I have got the following kernel I found on the NVIDIA site, and I just wanted to know if i did understand it. It should calculate the sum of all values of the given vector v.

``````#define N (1024)

__global__ void fastSum(float *v) {

__shared__ float *sum;

sum = v;

int bdx = blockDim.x;

int bx = blockIdx.x;

int t = tx + bx * bdx;

for(int stride = 1; stride < bdx; stride *= 2) {

if(t % (2*stride) == 0) {

sum[t] += sum[t+stride];

}

}

if(tx == 0 && bx != 0) {

v += sum[bx*bdx];

}

}
``````

I know it is not the the best algorithm to compute the sum of the values but thats not my point here.

My thoughts were:

To walk through more blocks than one, I have to add the __syncthreads at the end to wait until each thread of one block is finished. At this point, every thread of a block should have done his computations and then I am telling the thread with the index 0 of this block to add the computed result to the final result. Because the values in my vector are all 1.0, the ouput for N = 256 should be 256, for N = 1024 should be 1024 and so on … however, with this version now it is sometimes the right value, but sometimes not. For example, N = 768 … first time run result is 768 … next time run result is 384 or some other numbers. Where is the problem to be found in my code?

Thank you for helping me!

Regards,

A.

Does noone have an idea?

Regards,
A.

You have a classic memory race on v at the end which will mean the kernel can never reliably produce the correct answer when multiple blocks are running. Your use of shared memory is somewhat unusual too. Shared memory only has the lifespan of a single block, so addressing using block index is, at best, only going to waste a lot of shared memory. At worst you won’t be able to launch the kernel because of a lack of resources, or your kernel will happily write off the end of the allocated shared memory block and crash.