So, I thought that it wasn’t possible to communicate between threads in different blocks
but then I discover atomic functions (particularly atomicCAS()) which the docs just say

“The operation is atomic in the sense that it is guaranteed to be performed without
interference from other threads.”

Which suggests that it operates on ALL threads irrespective of blocks, is that correct?

Doesn’t that mean you can communicate between between threads in different blocks ?



Only if the execution of one block does not depend on the execution of another block. If there exists a scheduling where your kernel may deadlock, assume the hardware will do that.

Hello tmurray,

Would it be safe to say that atomicCAS would work to implement a correct lock that works with concurrent kernels.

I remember reading that, a deadlock can occur if the number of blocks is greater than the number of stream processors.

So my other question is, if we’re launching concurrent multiple concurrent kernels that make use of atomicCAS and operate on shared data, would the SUM of the # of blocks from all kernels have to be less than the number of processors?


That is generally completely unsafe.

Thanks for the clarification. So to be clear, if using the atomicCAS() on a device int as an atomic lock,
ALL threads in ALL blocks respect the atomic nature of the call and will block if waiting to access the lock?

Another, but related question, if I declare

shared int count;
int X;

within a kernel and then increment it using


it will only cause threads WITHIN a block to serialise (w.r.t. each other) to perform the atomicAdd() since
the variable being incremented is only visible to members of a block ?


I think you’re missing my point. Global locking using atomicCAS is unsafe if a thread that fails to acquire the lock cannot complete.

Wouldn’t that be the fault of the algorithm and simply mean that the locking algorithm is not correct for the GPU architecture?

What if we use a lock under strict assumptions, such as:

-only one thread in a warp can lock / unlock

-no use of syncthreads not even threadfence

-locking/unlocking done in the same order etc;

So my question is can we assume that such usage is safe? If not why?

Is there more info on scheduling on the GPU ond it’s effects on the situation?

The reason I’m interested in this , is because I’m already using such a mutex in a Container class managed by the device and it’s working quite reliably. The lock is fine grained (on a per Node basis) and contention is low and it doesn’t perform half-bad either. Knowing more about the issues above would help me understand if I can extend the algorithms to perform correctly using asynchronous calls.

Thank you

You’re making assumptions about how the scheduler on the SM works; you cannot make any such assumption.

So are you saying that only one thread in a warp should attempt to get a lock ?