I have a few questions related to atomic operations on global memory:
- How many atomic units are present and how many operations are possible completed every cycle (throughput)?
- What is the latency of atomic operations on global memory?
- How are atomic units implemented in the current generation hardware?
Would someone point to descriptions or answers to these questions?