how about is the performance difference between the arithmetic fucs like atomicMin() and if…then determinantion? Did anyone compare them? Why is the atomic fuc so fast? Offer the atomic fucs same performance in global memory? (beacuse in reference Appendix they are defined for both memory…)
Atomic functions are quite slow in global memory, it depends however how many threads access this atomic function. I would avoid atomic function if it is somehow possible. You serialize your code with a atomic function and you will loose any performance gain you may have.
I didn’t use atomic functions in shared memory, but they should be faster in any case.
Maybe it’s faster to split your kernel into multiple smaller kernels and do a min reduction of every value produced by every thread.