Understanding a spinlock implementation by Robert Crovella

Hi,

From this post (Try to use lock and unlock in CUDA), I was able to locate a stackoverflow answer (Cuda atomics change flag - Stack Overflow) by Robert Crovella that provides an example implementation of a spinlock using atomic compare and swap. The code is as follows:

__syncthreads();
if (threadIdx.x == 0)
  acquire_semaphore(&sem);
__syncthreads();
  //begin critical section
  // ... your critical section code goes here
  //end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
  release_semaphore(&sem);
__syncthreads();

where the helper functions are :

__device__ volatile int sem = 0;

__device__ void acquire_semaphore(volatile int *lock){
  while (atomicCAS((int *)lock, 0, 1) != 0);
  }

__device__ void release_semaphore(volatile int *lock){
  *lock = 0;
  __threadfence();
  }

This code is already very helpful, but I was hoping to make sure I really understand it correctly. First of all, in the first code block, there are 4 __syncthreads() and 1 __threadfence() calls. The __threadfence() after the critical section makes sense, since CUDA has a relaxed memory model, we’d need a __threadfence() to enforce the writes before that are visible to other blocks. However, for the 4 __syncthreads(), I think I understand only two of them, and my thoughts are as follows:

  • we need the second __syncthreads() because our master thread (threadIdx=0) may not succefully got the lock, and we don’t want other warps to proceed (into the critical section before threadIdx=0 has acquired the lock
  • we need the third __syncthreads() because we want make sure all warps have reached here (finished doing critical stuff) before the master thread attempts to release the unlock. If some warps still have not finished and the lock is released, mutual exclusion is violated

Basically I do see the need for the second and the third __syncthreads() to ensure mutual exclusion, but I don’t quite understand why we need the first and the last __syncthreads? Thanks!

For example, without the first __syncthreads(), the master thread could already acquired the lock, and other warps are still not reaching the if yet, but this seems fine to me? because it doesn’t violate mutual exclusion.

They are probably not necessary. The mechanism here is contending for a global lock. That is, it is expected to be competing with other threadblocks to acquire the lock.

Access to the critical section (i.e. having acquired the lock) implies that other threadblocks that may be wanting to access the critical section will be waiting.

Coupled with that, the expectation here is that you may wish to have not just a single thread but the entire threadblock work on the critical section, as you point out related to the second and third usage of __syncthreads().

So the thought process is that if we allow thread 0 in the block to acquire the lock when other threads in the block are not actually ready to proceed into the critical section, then we are holding onto the lock longer than is necessary. Given that the lock may be contended among multiple threadblocks, you may be holding up a large number of threadblocks this way. Stated another way, the first syncthreads confirms that the time spent in the second syncthreads will be as short as possible.

The first syncthreads confirms that all threads are ready to process the critical section, before thread 0 is allowed to compete for the lock.

It’s probably not necessary.

Regarding the first and last syncthreads, when you are writing a general purpose piece of code for such an activity, with no stipulations about what the code inside and outside the critical section will be doing, it’s hard to think of every conceivable case.

I see, thanks for the explanation! So as I understand, basically the first syncthreads is used to minimize the time in the critical section for the threadblock as a whole, so as to release the lock as quickly as possible. And the last syncthreads is more of a general library code consideration, as we don’t know what happens before and afterwards, so it maybe a good idea to sync again before exiting this, just to be safe.

I also had a data race question about *lock = 0 in release_semaphore, where the lock is released by a normal memory write, followed by a threadfence. Wouldn’t that cause a data race between *lock = 0, and the atomicCAS operation in acquire_semaphore? I understand that if two threads try to do atomicCAS at the same time, they will be serialized (and not overlap), but what if one thread does atomicCAS and another one does a normal *lock = 0? It seems to me that *lock = 0 should also use an atomic operation, to ensure itself and atomicCAS do not overlap when operating on the lock variable. Perhaps something like atomicExch(), instead of a normal store?

In addition, I was wondering if it’s possible to confirm my understanding of why the original post (Try to use lock and unlock in CUDA) can have a deadlock. You had a detailed explanation in this stackoverflow answer (Cuda Mutex, why deadlock? - Stack Overflow), but I was hoping to make sure I am following it correctly. I’ll paste the incorrect code here:

__global__ void lockAdd(int*val, int* mutex) {
	while (0 != (atomicCAS(mutex, 0, 1))) {}//Similar to spin lock
	(*val)++;//all threads add one to the value
	atomicExch(mutex, 0);//unlock
}

So as I understand, the reason for deadlock is first due to warp divergence. The way the GPUs execute branches is by executing both the if section and the fall-thru section, but selectively disable the lanes where the if conditions evaluate to False and True, respectively. But this is not the whole reason yet, another key is that there is no guarantee whether the True path or the False path will be executed first, the HW is free to do either. I should add that the compiler could arange the two paths either way as well. If we expand the while loop in the original code into simple control flows, it becomes easier to see both the True and the False paths:

L0:
old = (atomicCAS(mutex, 0, 1))
if old != 0:
  go to L0
L1:
// do critical work
...
atomicExch(mutex, 0);//unlock

Note that the compiler could reverse the if conditions and swap the two paths. This is up to the compiler. When this is executed by the HW with warp divergence, what will happen is that the threads could get different values of old, because only one thread will actually successully compare and swap, and set the mutex to 1, in which case only one thread will have its old being 0, and all other threads’ old are 1. Now we begin the warp divergence, and the HW can choose to first go to either L0 or L1, there’s no guarantee which branch will be executed first. As a result, when L0 is executed first, the program deadlocks since there’s no chance for the lock-holding thread to reach the unlock. And when L1 is executed first, unlock could be reached before going back to L0, which will no longer deadlock since the lock gets a chance to be released.

Is is possible to confirm this understanding is correct? I apologize for the long text, and thank you a lot for taking the time to read it!

I don’t understand the concern. If you believe there is a race condition (of concern), then that would suggest to me that you believe there is an important difference in outcome.

The lock is acquired in a loop. If the initial try is unsuccessful, it is repeated.

So we have one thread that is repeatedly trying to acquire the lock (looking for an atomic with a returned value of 0). The other thread is writing zero to the atomic location.

There are only two possible outcomes:

  1. the atomic returns a value of 0. This could only happen if, from the point of view of the atomic, a previous write of 0 was successful.
  2. the atomic returns a non-zero value. This could only happen if, from the point of view of the atomic, a previous proper write of 0 had not yet taken place. In this case the atomic will be retried. Eventually we exit the retry loop because condition 1 is met.

I don’t see a hazard there. The atomic operation cannot “wipe out” or permanently corrupt the write of 0. Eventually the zero value will be visible (promoted by the __threadfence()) and the atomic will “succeed”.

Even if we posit the possibility of data tearing (which I do not believe is possible) the atomic method should still work. A torn read or write simply means that in the passing case, the atomic got a zero, which implies success and could not happen any other way (given the choices of 0 and 1 - you cannot tear a bit) and in the failing case, we do not care, it will be retried.

Yes, many of the questionable mutex mechanisms (IMO) attempt to negotiate for locks intra-warp, which is or was (pre-volta) particularly hazardous. The mechanism you started out asking about was created to specifically avoid that, amongst other objectives.

I wouldn’t argue about most of the rest of what you have posted. However, I think its important (because others will read this post) that Volta changes things considerably. And therefore, the most hazardous cases apply to hardware that is now more than 5 years old. (Volta was introduced in 2017). In my opinion, a key point of understanding is the one you have bolded. The hardware decision-making in the presence of warp divergence is what underpins all of this. The example that started your post attempted to avoid any dependence on HW decision-making, even pre-Volta.

Regarding the race condition, thank you for explaining what could happen and the two possible outcome. I guess my real question is about the semantics of the code, i.e. is there a data race and does it lead to a undefined behavior? I was thinking from a C++ point of view. because in C++, this code technically has a data race, because the store (unlock) is not explicitly made atomic, so it can be interrupted by an atomicCAS (from another thread). And in C++, a data race is undefined behavior, and technically makes a program not possible to reason about.

but I realized that this is CUDA C++ code (the code is tightly coupled with NVIDIA GPUs), instead of C++ (which doesn’t assume any underlying architecture). And there are at least two differences

  1. In CUDA C++ (specific to NVIDIA hardware), I think it’s possible that any load and store to aligned global memory addresses are atomic, as in X86. so just a normal store could be atomic if aligned, in which case, there is no longer a data race.
  2. Even if the store is not atomic (which implies a data race), it is not a strict undefined behavior as in C++. my understanding is that CUDA C++ (and NVIDIA hardware) provides stronger guarantee, for such data races. So even if there is technically a data race, we can still reason about the program. And as you mentioned, even if the store can be interrupted (the data tearing case), the code should still work properly, due to the reasons you have shown above.

FYI, the C++ logic I was thinking about will look like this:

struct cas_lock {
  std::atomic<bool> lock_ = {false};

  void lock() { while(lock_.exchange(true, std::memory_order_acquire)); }

  void unlock() { lock_.store(false, std::memory_order_release); }
};

where the lock variable would be explicitly declared atomic, to make the program semantic portable.

And in terms of the memory ordering, I think even if a normal memory store in CUDA doesn’t imply any memory fence, the code should still work, due to the explicit __threadfence() after the critical section, which functions as std::memory_order_release here and prevents later statements from executing before the fence. And when locking, I think the atomicCAS call already implies a memory fence, which could function as std::memory_order_acquire.

Feel free to use an atomic if you are worried about the C++ issue you mention. The subject of simultaneous access to memory (unconstrained by atomics) in CUDA is a complex one. There is a PTX memory model that is required for a formal treatment/understanding of it. But in CUDA C++, to my knowledge, simultaneous access to the same location by separate threads (unconstrained by atomics) is not undefined behavior in the most general sense of the words - “anything at all could happen, including machine meltdown, appearance of data out of thin area, or a quantum fire burning up the universe”. Yes, there are certain aspects of it that are undefined (because the ordering is undefined if you don’t control it), but in my experience and understanding, it is not true that “any outcome is possible”. The possible outcomes still belong to a finite set.

Anyhow this is a complex topic, and I won’t be able to sort it all out, and generally wouldn’t try. Use atomics for the store/release if you like. It’s a fine solution. Rather than using atomics for this type of activity, I think the “modern” method would be to use libcu++ such as a binary semaphore. This thread is discussing something I wrote years ago, when libcu++ did not exist.

I see, thank you very much for confirming that CUDA C++ does guarantee something stronger for this situation, and it won’t result in time travel or anything, as in the most general sense of undefined behavior.

And thank you for the additional links to the “modern” methods!

But in CUDA C++, to my knowledge, simultaneous access to the same location by separate threads (unconstrained by atomics) is not undefined behavior in the most general sense of the words - “anything at all could happen, including machine meltdown, appearance of data out of thin area, or a quantum fire burning up the universe”. Yes, there are certain aspects of it that are undefined (because the ordering is undefined if you don’t control it), but in my experience and understanding, it is not true that “any outcome is possible”. The possible outcomes still belong to a finite set.