without atomic functions, is there some trick for efficient read-modify-write for e.g. float?
for example, i have an array, defined in global memory, and tried to change the elements in my kernel like:
testV[ix] = textV[ix] + newV;
i’ve also tested the attomicAdd() for int, it bings not actually much improvement.
You could use reduction to sum the values calculated per thread into a single float value for the current block, write each block’s output
to a different position in the final output array and then either sum the block’s results in another kernel or on the CPU.
As for performance it probably depends on how much work each thread has to do. If each thread calculates a lot and then you need
to sum the results, such scenario, as described above, is probably the best solution (I’ve used something like this in my kernels).
I hope this is what you meant :)
Atomic functions don’t have any performance impact for compute capability <= 1.1
Did you try AtomicExch()?