Interpretation of Kernel

alexnovice · August 5, 2009, 1:00pm

Hello all :)

I have got the following kernel I found on the NVIDIA site, and I just wanted to know if i did understand it. It should calculate the sum of all values of the given vector v.

#define N (1024)

__global__ void fastSum(float *v) {

	

	__shared__ float *sum;

	

	sum = v;

	

	int tx = threadIdx.x;

	int bdx = blockDim.x;

	int bx = blockIdx.x;

	

	int t = tx + bx * bdx;

	

	for(int stride = 1; stride < bdx; stride *= 2) {

		__syncthreads();

		

		if(t % (2*stride) == 0) {

			sum[t] += sum[t+stride];

		}

	}

	__syncthreads();

	

	if(tx == 0 && bx != 0) {

		v[0] += sum[bx*bdx];

	}

}

I know it is not the the best algorithm to compute the sum of the values but thats not my point here.

My thoughts were:

To walk through more blocks than one, I have to add the __syncthreads at the end to wait until each thread of one block is finished. At this point, every thread of a block should have done his computations and then I am telling the thread with the index 0 of this block to add the computed result to the final result. Because the values in my vector are all 1.0, the ouput for N = 256 should be 256, for N = 1024 should be 1024 and so on … however, with this version now it is sometimes the right value, but sometimes not. For example, N = 768 … first time run result is 768 … next time run result is 384 or some other numbers. Where is the problem to be found in my code?

Thank you for helping me!

Regards,

A.

alexnovice · August 11, 2009, 12:38pm

Does noone have an idea?

Regards,
A.

avidday · August 11, 2009, 1:46pm

You have a classic memory race on v[0] at the end which will mean the kernel can never reliably produce the correct answer when multiple blocks are running. Your use of shared memory is somewhat unusual too. Shared memory only has the lifespan of a single block, so addressing using block index is, at best, only going to waste a lot of shared memory. At worst you won’t be able to launch the kernel because of a lack of resources, or your kernel will happily write off the end of the allocated shared memory block and crash.

alexnovice · August 11, 2009, 2:09pm

Thank you for your response.

Ok I understand. So it would be better to let each thread load one element of the vector in the shared memory when its needed instead of putting the whole vector in shared memory?

How can the memory race on v[0] be solved, or better what is a good way to add up all partial results at the end?

Regards,
A.

avidday · August 11, 2009, 2:56pm

It really doesn’t make much sense to do anything else. Apart from limiting yourself to a maximum array size of 4096 floats (there is only a total of 16k of shared memory per running block), you are incurring an enormous global memory latency penalty to load data you don’t use.

Allocate an output vector and have each block write its sum into that. You then only need to run a summation through the output vector (one number per block) to get the final global sum. You probably should take a look at the reduction or newreduction examples in the SDK. They contain a very optimal reduction case which does what you are trying to do, and there is a nice whitepaper to go along with it that explains most of the practicalities of implementing parallel reductions in CUDA.

Topic		Replies	Views
computing a sum leads to infinite values CUDA Programming and Performance	3	5375	September 16, 2008
Accumulate value within block CUDA Programming and Performance	15	3063	October 16, 2010
do not understand thread/block division CUDA Programming and Performance	10	2791	April 23, 2012
Newbie question about shared memory CUDA Programming and Performance	0	4740	October 16, 2009
HELP with vector sum CUDA Programming and Performance	6	2192	May 11, 2010
Optimizing calculation question Euclidean distance between arrays CUDA Programming and Performance	5	5329	March 1, 2010
I want to calculate the sum of the 512 lines CUDA Programming and Performance	16	2012	January 4, 2013
Slow Performance CUDA Programming and Performance	19	8122	November 24, 2008
Unexpected Behaviour 2+ block thread communication issues CUDA Programming and Performance	0	2275	February 23, 2010
Easyway to compute the sum of the array? CUDA Programming and Performance	4	8015	February 13, 2008

Interpretation of Kernel

Related topics