Increment Global __device__ Issue

I am trying to accumulate results from every thread in a device variable. When I run the program, it will only have results from the last thread to touch the variable.

For example:

__device__ float result=0;

__global__ void gpuCode()

{

   result+=threadIdx.x;

}

void host()

{

 float result_h;

 gpuCode<<<1,N>>>();

 cudaMemcpyFromSymbol((void*)&result_h,result,sizeof(float));

 printf("result: %f",result_h);

}

You can’t do that. To perform that action you need to use reduction or atomic instructions for integer numbers.

That works like a CPU… you’ll need a critical section/InterlockedIncrement to avoid race conditions and to sync writes… Like CUDA does not have critical sections, use the AtomicAdd for compute 1.1 or above hardware or make a reduction kernel.

And atomic operations won’t work with float operations.
Look at the SDK example project “Reduction”. It’s exactly what you want.