Is there a way to simultaneously record three consecutive numbers x, y, z faster than using three separate atomic operations - atomicAdd?
Is it possible to somehow combine?
A software mutex in CUDA is almost certainly going to be far slower than calling 3 atomic operations. Atomics have gotten much faster in Kepler (and already were pretty fast in Fermi), so I would just use them directly.