Interpretation of Kernel

Hello all :)

I have got the following kernel I found on the NVIDIA site, and I just wanted to know if i did understand it. It should calculate the sum of all values of the given vector v.

#define N (1024)

__global__ void fastSum(float *v) {

	

	__shared__ float *sum;

	

	sum = v;

	

	int tx = threadIdx.x;

	int bdx = blockDim.x;

	int bx = blockIdx.x;

	

	int t = tx + bx * bdx;

	

	for(int stride = 1; stride < bdx; stride *= 2) {

		__syncthreads();

		

		if(t % (2*stride) == 0) {

			sum[t] += sum[t+stride];

		}

	}

	__syncthreads();

	

	if(tx == 0 && bx != 0) {

		v[0] += sum[bx*bdx];

	}

}

I know it is not the the best algorithm to compute the sum of the values but thats not my point here.

My thoughts were:

To walk through more blocks than one, I have to add the __syncthreads at the end to wait until each thread of one block is finished. At this point, every thread of a block should have done his computations and then I am telling the thread with the index 0 of this block to add the computed result to the final result. Because the values in my vector are all 1.0, the ouput for N = 256 should be 256, for N = 1024 should be 1024 and so on … however, with this version now it is sometimes the right value, but sometimes not. For example, N = 768 … first time run result is 768 … next time run result is 384 or some other numbers. Where is the problem to be found in my code?

Thank you for helping me!

Regards,

A.

Does noone have an idea?

Regards,
A.

You have a classic memory race on v[0] at the end which will mean the kernel can never reliably produce the correct answer when multiple blocks are running. Your use of shared memory is somewhat unusual too. Shared memory only has the lifespan of a single block, so addressing using block index is, at best, only going to waste a lot of shared memory. At worst you won’t be able to launch the kernel because of a lack of resources, or your kernel will happily write off the end of the allocated shared memory block and crash.

Thank you for your response.

Ok I understand. So it would be better to let each thread load one element of the vector in the shared memory when its needed instead of putting the whole vector in shared memory?

How can the memory race on v[0] be solved, or better what is a good way to add up all partial results at the end?

Regards,
A.

It really doesn’t make much sense to do anything else. Apart from limiting yourself to a maximum array size of 4096 floats (there is only a total of 16k of shared memory per running block), you are incurring an enormous global memory latency penalty to load data you don’t use.

Allocate an output vector and have each block write its sum into that. You then only need to run a summation through the output vector (one number per block) to get the final global sum. You probably should take a look at the reduction or newreduction examples in the SDK. They contain a very optimal reduction case which does what you are trying to do, and there is a nice whitepaper to go along with it that explains most of the practicalities of implementing parallel reductions in CUDA.