Understanding a spinlock implementation by Robert Crovella

zt9465 · September 25, 2023, 7:09pm

I see, thanks for the explanation! So as I understand, basically the first syncthreads is used to minimize the time in the critical section for the threadblock as a whole, so as to release the lock as quickly as possible. And the last syncthreads is more of a general library code consideration, as we don’t know what happens before and afterwards, so it maybe a good idea to sync again before exiting this, just to be safe.

I also had a data race question about *lock = 0 in release_semaphore, where the lock is released by a normal memory write, followed by a threadfence. Wouldn’t that cause a data race between *lock = 0, and the atomicCAS operation in acquire_semaphore? I understand that if two threads try to do atomicCAS at the same time, they will be serialized (and not overlap), but what if one thread does atomicCAS and another one does a normal *lock = 0? It seems to me that *lock = 0 should also use an atomic operation, to ensure itself and atomicCAS do not overlap when operating on the lock variable. Perhaps something like atomicExch(), instead of a normal store?

In addition, I was wondering if it’s possible to confirm my understanding of why the original post (Try to use lock and unlock in CUDA) can have a deadlock. You had a detailed explanation in this stackoverflow answer (Cuda Mutex, why deadlock? - Stack Overflow), but I was hoping to make sure I am following it correctly. I’ll paste the incorrect code here:

__global__ void lockAdd(int*val, int* mutex) {
	while (0 != (atomicCAS(mutex, 0, 1))) {}//Similar to spin lock
	(*val)++;//all threads add one to the value
	atomicExch(mutex, 0);//unlock
}

So as I understand, the reason for deadlock is first due to warp divergence. The way the GPUs execute branches is by executing both the if section and the fall-thru section, but selectively disable the lanes where the if conditions evaluate to False and True, respectively. But this is not the whole reason yet, another key is that there is no guarantee whether the True path or the False path will be executed first, the HW is free to do either. I should add that the compiler could arange the two paths either way as well. If we expand the while loop in the original code into simple control flows, it becomes easier to see both the True and the False paths:

L0:
old = (atomicCAS(mutex, 0, 1))
if old != 0:
  go to L0
L1:
// do critical work
...
atomicExch(mutex, 0);//unlock

Note that the compiler could reverse the if conditions and swap the two paths. This is up to the compiler. When this is executed by the HW with warp divergence, what will happen is that the threads could get different values of old, because only one thread will actually successully compare and swap, and set the mutex to 1, in which case only one thread will have its old being 0, and all other threads’ old are 1. Now we begin the warp divergence, and the HW can choose to first go to either L0 or L1, there’s no guarantee which branch will be executed first. As a result, when L0 is executed first, the program deadlocks since there’s no chance for the lock-holding thread to reach the unlock. And when L1 is executed first, unlock could be reached before going back to L0, which will no longer deadlock since the lock gets a chance to be released.

Is is possible to confirm this understanding is correct? I apologize for the long text, and thank you a lot for taking the time to read it!

Topic		Replies	Views
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	20264	January 21, 2010
(Errata for CUDA by Example) Is atomicCAS() safe to simulate lock even with __threadfence()? CUDA Programming and Performance	11	3339	September 17, 2018
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4867	March 17, 2010
Mutex problem problem with global mutex CUDA Programming and Performance	19	11351	November 17, 2010
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6280	November 6, 2009
atomic locks CUDA Programming and Performance	15	13116	January 27, 2012
Shared mem atomics Help needed to Fix this hang CUDA Programming and Performance	12	11882	June 8, 2009
Shared mem atomics Repeat topic CUDA Programming and Performance	47	9008	December 1, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55963	March 8, 2007
Atomic Operations in CUDA CUDA Programming and Performance	5	29344	June 9, 2009

Understanding a spinlock implementation by Robert Crovella

Related topics