I see, thanks for the explanation! So as I understand, basically the first syncthreads is used to minimize the time in the critical section for the threadblock as a whole, so as to release the lock as quickly as possible. And the last syncthreads is more of a general library code consideration, as we don’t know what happens before and afterwards, so it maybe a good idea to sync again before exiting this, just to be safe.
I also had a data race question about *lock = 0 in release_semaphore, where the lock is released by a normal memory write, followed by a threadfence. Wouldn’t that cause a data race between *lock = 0, and the atomicCAS operation in acquire_semaphore? I understand that if two threads try to do atomicCAS at the same time, they will be serialized (and not overlap), but what if one thread does atomicCAS and another one does a normal *lock = 0? It seems to me that *lock = 0 should also use an atomic operation, to ensure itself and atomicCAS do not overlap when operating on the lock variable. Perhaps something like atomicExch(), instead of a normal store?
In addition, I was wondering if it’s possible to confirm my understanding of why the original post (Try to use lock and unlock in CUDA) can have a deadlock. You had a detailed explanation in this stackoverflow answer (Cuda Mutex, why deadlock? - Stack Overflow), but I was hoping to make sure I am following it correctly. I’ll paste the incorrect code here:
__global__ void lockAdd(int*val, int* mutex) {
while (0 != (atomicCAS(mutex, 0, 1))) {}//Similar to spin lock
(*val)++;//all threads add one to the value
atomicExch(mutex, 0);//unlock
}
So as I understand, the reason for deadlock is first due to warp divergence. The way the GPUs execute branches is by executing both the if section and the fall-thru section, but selectively disable the lanes where the if conditions evaluate to False and True, respectively. But this is not the whole reason yet, another key is that there is no guarantee whether the True path or the False path will be executed first, the HW is free to do either. I should add that the compiler could arange the two paths either way as well. If we expand the while loop in the original code into simple control flows, it becomes easier to see both the True and the False paths:
L0:
old = (atomicCAS(mutex, 0, 1))
if old != 0:
go to L0
L1:
// do critical work
...
atomicExch(mutex, 0);//unlock
Note that the compiler could reverse the if conditions and swap the two paths. This is up to the compiler. When this is executed by the HW with warp divergence, what will happen is that the threads could get different values of old, because only one thread will actually successully compare and swap, and set the mutex to 1, in which case only one thread will have its old being 0, and all other threads’ old are 1. Now we begin the warp divergence, and the HW can choose to first go to either L0 or L1, there’s no guarantee which branch will be executed first. As a result, when L0 is executed first, the program deadlocks since there’s no chance for the lock-holding thread to reach the unlock. And when L1 is executed first, unlock could be reached before going back to L0, which will no longer deadlock since the lock gets a chance to be released.
Is is possible to confirm this understanding is correct? I apologize for the long text, and thank you a lot for taking the time to read it!