I’m trying to add my own version of atomicAdd with floats on global memory.
/* Off the topic - it’s a bit weird that this intrinsic is not available out of the box, but anyway */
In the following, all threads on the card attempt to access the same global memory location to add 1.0f ( essentially should count the number of threads invoked)
__device__ void myAtomicAdd(float* addr, float data){
float a=*addr;
float b=0;
do{
b=a;
a=b+data;
a=atomicExch(addr,a);
}while(a!=b);
}
__global__ void testKernel(float* g_odata)
{
myAtomicAdd(g_odata,1.0f);
}
void run(){
....
/* initialize kernel h_odata with zero */
.....
testKernel<<<numBlocks, numThreads>>>(h_odata);
cudaMemcpy( h_odata, d_odata, memSize, cudaMemcpyDeviceToHost);
printf("result: %f expected: %d \n",*h_odata,numBlocks*numThreads);
....
}
It works correctly for numBlocks=1, numThreads<512.
For 512 threads it gets stuck completely
For numBlocks>1 it gives incorrect results ( less than it should)
I’m using G280 card, 1.3 compute capability