I am trying to accumulate results from every thread in a device variable. When I run the program, it will only have results from the last thread to touch the variable.
For example:
__device__ float result=0;
__global__ void gpuCode()
{
result+=threadIdx.x;
}
void host()
{
float result_h;
gpuCode<<<1,N>>>();
cudaMemcpyFromSymbol((void*)&result_h,result,sizeof(float));
printf("result: %f",result_h);
}
You can’t do that. To perform that action you need to use reduction or atomic instructions for integer numbers.
That works like a CPU… you’ll need a critical section/InterlockedIncrement to avoid race conditions and to sync writes… Like CUDA does not have critical sections, use the AtomicAdd for compute 1.1 or above hardware or make a reduction kernel.
And atomic operations won’t work with float operations.
Look at the SDK example project “Reduction”. It’s exactly what you want.