atomic read or write

FangQ · July 14, 2009, 10:30pm

I am working on a program which needs to perform an atomic floating-point add from each thread. I noticed there is no atomicAdd for floats, so I used the following:

atomicExch((int *)&globaldata[ishift], __float_as_int(globaldata[ishift]+newval));

where float globaldata is a global memory array, float newval is the value to be added.

I run this and it seems working fine. but there is about 35% efficiency drop comparing with non-atomic memory write (i.e. globaldata[ishift]+=newval).

it looks to me the above code did global-memory read twice: one when calling atomicExch, one when calling globaldata[ishift] for the addition. I don’t know if this is responsible to the speed drop (I know atomic calls are also expensive).

I have two questions:

does the above statement for atomic-float-add make sense to you?
is there an atomic function just perform a single global-memory write? if there is, I can perhaps save the memory read associated with atomicExch.

thanks

seibert · July 15, 2009, 2:04am

This still has a race condition. The expression for the second argument is not evaluated atomically, so after the read in the sum, but before the atomic exchange is performed, another thread could start the same process and one of the additions would be lost.

This post has an atomic float add that works, but is very slow:

[url=“http://forums.nvidia.com/index.php?s=&showtopic=67691&view=findpost&p=380935”]http://forums.nvidia.com/index.php?s=&...st&p=380935[/url]

FangQ · July 15, 2009, 3:35am

I see the problem.

I actually tried the atomicFloatAdd earlier today, it only works when I used a small number of threads. When I used more threads, I kept getting “the launch timed out and was terminated” error, as I posted at http://forums.nvidia.com/index.php?showtopic=101970

It would be nice to allow programs to run as long as they want if they are using resources within the hardware limitation.

Sarnath · July 15, 2009, 8:38am

FangQ,

There was a big thread on implementing spinlocks on GPU…

The agreed approach is to first allow 1 thread from each block to contest for the lock… And then all threads inside that block can try locking on a shared mem lock (1.2 compute) and obtain exclusivity…

If all threads contend - you will get lauch timeout… It takes a LOT of time.