Hi,
we need to build sum of float data in a grid, but cuda has no atomicAdd(float*,float). I have search in this forum but found only one slow solution.
I know that ( A+B )+C != A+(B+C) if all data are float.
So I build this for me, but would give this code to all for solve related problems.
This implementation can change to to mul,sub,div,…
I have run a little test to check the speed of this method, i sum all number from 1.0 till 256000.0, than I do the same with a atomicAdd(int*,int) and show the runtime.
FORUM = implementation in this forum
MY = this implementation (see bottom)
INT = implementation with atomicAdd(int*,int)
MY | FORUM | INT
50ms | 34862ms| 8ms
MY atomicAdd() implementation is 6 times slower as atomicAdd(int).
I call kernel myAdd with this parameter. “out” is a pointer to one float value and “in” is a array with 256000 value ( 1 till 255999)
myAdd<<<100,256>>>(out,in);
A implementation which return old data
__device__ inline float atomicAdd(float* address, float value)
{
float old = value;
float ret=atomicExch(address, 0.0f);
float new_old=ret+old;
while ((old = atomicExch(address, new_old))!=0.0f)
{
new_old = atomicExch(address, 0.0f);
new_old += old;
}
return ret;
}
__global__ void myAdd(T *out,T* a )
{
atomicAdd(out,a[blockIdx.x*blockDim.x+threadIdx.x]);
};
A implementation without return (need not so many register)
__device__ inline void atomicAdd(float* address, float value)
{
float old = value;
float new_old;
do
{
new_old = atomicExch(address, 0.0f);
new_old += old;
}
while ((old = atomicExch(address, new_old))!=0.0f);
};