How would I go about locking down data using mutual exclusion in CUDA?
A CAS style mutex can be implemented in CUDA using atomics on global device memory… but they’re incredibly slow.
What would be the best way for threads to access shared memory?
By declaring shard mem arrays.
It depend where the data reside: Global Memory or Shared Memory.
Shared Memory: the best is to avoid Atomic Operations in shared memory, they are supported only on 1.3+ devices, and very slow.
You could use a loop (unrolled and executed by each thread), where each iteration will enable 1 thread to work (cost: approx 96 cycles/32 threads).
Global Memory: you should avoid Atomic Operation, or limit them.
One trick is to hierarchize work, on shared memory first, and then do it at-once on Global Memory wether it’s 1 operation or 32 per warp), to limit the number of Global Memory atomic operations.
Another is to queue request per warp (pre-optimized in shared memory) and have a warp that process these requests on a round-robin manner.
Anyway, Aomtic Operations are really slow!