How's atomic operations in CUDA implemented?

liuxs · February 14, 2024, 9:18pm

Hi, I want to understand how’s atomic operations implemented on CUDA. Are they all operations implemented on hardware or using the atomicCAS? Will ABA problem be an issue in CUDA atomic? Is there any way to do atomic fetch-and-add instead of CAS?

Robert_Crovella · February 14, 2024, 10:45pm

mostly they are implemented in hardware, judging from SASS code as well as the documentation.

The ABA problem should not be an issue. From an access perspective, a CUDA atomic is uninterrruptible. The Read-Modify-Write operation is conducted without the possibility of another thread to intervene in any way.

atomic fetch and add is implemented in CUDA hardware as atomicAdd.

CU_Steve · February 15, 2024, 6:23pm

I wanted to point out a related experience I had.

I was using a volatile write to zero a variable that another thread was incrementing via atomicAdd(). Sometimes the volatile write succeeded, and sometimes it did not.

So even though other threads cannot interfere with the atomic operation, this does not mean that the atomic operation somehow holds off accesses to the same location by other threads (unless the access is another atomic…()).

I further discovered via benchmarking that using atomic…() is very fast for the calling thread if the return value is not used. So there was little point in my trying to optimize by using volatile anyway.

King_Crimson · February 26, 2024, 12:34pm

According to C++ standard, if a thread modifies a memory location via atomic operation, and another thread accesses the same memory location with non-atomic operation (regardless the volatile qualifier), a data race ensues. It sounds like you need an appropriate memory synchronization mechanism to avoid the data race, e.g. using acquire-release from libcu++ library cuda::atomic::load()/store() or from cuda::atomic_thread_fence(), or using CUDA’s __threadfence() interface (which results in the strictest sequential consistency memory order).

Robert_Crovella · March 7, 2024, 7:00pm

I’m skeptical.

In my view, the write will always “succeed”, that is, be carried out. It may not have had the effect you were expecting for any number of reasons.

Anyway, no proof has been offered, nor any case to inspect. So my comment here is merely to prevent the proliferation of future readers saying things like “I read on the internet that if you do a write to a location and an atomic, the write may not succeed.” Which to me is a dubious, and at this time, unsupported claim.

I’m not sure what you are trying to claim here.

The atomic op most certainly does hold off access to the same location by other threads, whether those threads are doing atomics or anything else, during the boundary of the atomic op.

My claim is that an atomic to device memory, from a device thread, without exception will perform a read-modify-write operation such that the read step, and the modify step, and the write step, will all be performed in such a way that no activity from any other device thread can in any way interfere, intervene, interrupt, or otherwise affect those 3 operations. Once the read step begins, until the write step ends, no other thread can read or write that location.

If you intermix writes and atomics, I know of no reason the writes would not “succeed”. They will be applied outside the boundary of any particular atomic op.

I’m not making C++ statements here. I am describing the behavior that I understand of extant CUDA devices.

Here is a trivial test case:

# cat t150.cu
#include <iostream>

__global__ void k(volatile int *d){

  for (int i = 0; i < 256; i++) atomicAdd((int *)d, 1);
  if (!threadIdx.x) *d = 0;
  for (int i = 0; i < 256; i++) atomicAdd((int *)d, 1);
}

int main(){

  int *d;
  cudaMallocManaged(&d, sizeof(*d));
  *d = 0;
  k<<<1,2>>>(d);
  cudaDeviceSynchronize();
  std::cout << *d << std::endl;
}
# nvcc -o t150 t150.cu
# ./t150
512
#

The only way we get 512 instead of 1024 output is if the non-atomic write “succeeded”. I acknowledge this doesn’t prove much, but I would be skeptical of any claims that it might not succeed without an actual test case to inspect.

For amusement purposes, here is another test case:

# cat t151.cu
#include <iostream>

__global__ void k(volatile int *d){

  if (!blockIdx.x)
    for (int i = 0; i < 1024; i++) atomicAdd((int *)d, 1);
  else {
         while  (*d == 0) {};
         *d = 0;}
}

int main(){

  int *d;
  cudaMallocManaged(&d, sizeof(*d));
  *d = 0;
  k<<<2,1>>>(d);
  cudaDeviceSynchronize();
  std::cout << *d << std::endl;
}
# nvcc -o t151 t151.cu
# ./t151
831
#

CU_Steve · March 8, 2024, 12:42am

Fair enough.

I cannot find the code that led to my original observation nor any comments related to it.

FYI, the code likely involved two concurrent kernels accessing the variable.

I have tried to reproduce the observation in a standalone program, which uses two concurrent kernels, but I have failed. Everything works as you say.

So for now, “Mark it down as another UFO”, as they said in a Star Trek episode, and please accept my apologies.

CU_Steve · March 12, 2024, 11:03pm

I recreated my UFO observation, but I could only do it when using shared memory, which was the last thing I tried, of course.

Using global memory for the victim variable worked as you said it should work in every case I tried: single blocks, multiple blocks, multiple concurrent kernels, etc.

I still don’t understand why the code in the post below doesn’t work with a volatile write.
It works fine if I substitute atomicExch() for the volatile write.

Given King_Crimson’s reply, I don’t think this is a bug as far as C++ goes, but I still find it an interesting phenomenon.

See:

Robert_Crovella · March 12, 2024, 11:35pm

~~I’ve responded in that thread. your observation is unreproducible for me, contains uncompilable code (per godbolt), and has strange looking atomic construction (to me).~~

I would like to amend my comments above as applying to global atomics. For shared atomics prior to maxwell architecture (compute capability less than 5.0, not supported by any current CUDA toolkits), I’m fairly confident it is unsafe to mix ordinary shared accesses with atomic shared accesses to the same location. By unsafe I mean unexpected or unpredictable results.

Maxwell shared atomics removed the hazard I am familiar with/have in mind for pre-cc5.0 devices.

My testing on a later architecture (cc8.9) doesn’t show any issues for me mixing ordinary shared accesses with shared atomics, and there is a report that things are not working on cc 6.1. I have not tested all architectures.

system · March 26, 2024, 11:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mixing volatile writes and atomicAdd()s with shared memory produces unexpected results CUDA Programming and Performance	8	631	March 13, 2024
Useful Arbitrary Atomic Operation Hack CUDA Programming and Performance	0	10058	July 20, 2008
unexpected behaviour from atomics on Volta CUDA Programming and Performance	12	1585	May 31, 2019
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1478	September 26, 2023
AtomicAdd algorithm CUDA Programming and Performance	7	3773	August 25, 2009
Atomic float operations. especially add CUDA Programming and Performance	10	16385	July 31, 2009
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19746	January 21, 2010
Weird behavior of atomic operations on Ampere architecture GPUs CUDA Programming and Performance cuda	9	1399	September 10, 2021
atomicAdd(float,float) - atomicMul(float,float) ... CUDA Programming and Performance	13	56523	July 29, 2010
Where do atomic operations go, and why are atomics to __shared__ faster than those to GMEM? CUDA Programming and Performance	6	2433	July 11, 2022

How's atomic operations in CUDA implemented?

Related topics