Weird behavior of atomic operations on Ampere architecture GPUs

expye · July 11, 2021, 5:45am

I tried the following code on two machines, one with GTX 1060 and another one with RTX 3070:

#include <iostream>
#include <cuda_runtime.h>

constexpr int num_blocks = 1024;
constexpr int num_threads = 32;

struct Lock {
  int *locked;

  Lock() {
    int init = 0;
    cudaMalloc(&locked, sizeof(int));
    cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice);
  }

  ~Lock() {
    cudaFree(locked);
  }

  __device__ __forceinline__ void acquire_lock() {
    while (atomicCAS(locked, 0, 1) != 0);
  }

  __device__ __forceinline__ void unlock() {
    atomicExch(locked, 0);
  }
};

__global__ void counter(Lock lock, int *total) {
  if (threadIdx.x == 1) {
    lock.acquire_lock();
    *total = *total + 1;
    lock.unlock();
  }
}

int main() {
  int *total_dev;
  cudaMalloc(&total_dev, sizeof(int));
  int total_host = 0;
  cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice);
  Lock lock;
  counter<<<num_blocks, num_threads>>>(lock, total_dev);
  cudaDeviceSynchronize();
  cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost);
  std::cout << total_host << std::endl;
  cudaFree(total_dev); 
}

What the program did is basically counting the number of blocks, with the help of lock constructed by atomicCAS and atomicExch instructions.

On my GTX 1060 machine, the program successfully outputs the expected result 1024, however it didn’t produce stable result on my RTX 3070 machine (varies from 40 to 100 in multiple runs). I use cuda 11.3 on both instances.

Robert_Crovella · July 12, 2021, 2:02am

when sharing global state amongst threadblocks, the lock/unlock you have is insufficient to ensure correctness. memory access ordering also matters, taking into account caches. I think it’s likely you could fix this by adding volatile keyword to int *total.

__global__ void counter(Lock lock, volatile int *total) {

That does “seem” to fix it for me, but after thinking about this a bit more, I believe you might be better advised to put a threadfence instead:

__global__ void counter(Lock lock, int *total) {
  if (threadIdx.x == 1) {
    lock.acquire_lock();
    *total = *total + 1;
    __threadfence();
    lock.unlock();
  }
}

Robert_Crovella · July 12, 2021, 3:00am

On the subject of design flaws, your code will throw an error if run with cuda-memcheck.

The reason for this is as follows.

You are passing lock by-value to the kernel.
When we pass a class/struct in C++ by-value to a function, a local copy of that class/struct is made for use by the function. This local copy will either use a user-supplied copy constructor, or a default copy-constructor if no user-provided copy constructor is provided (assuming the class is trivially copyable, which this one is.)
At the completion of the function call (the kernel, in this case) the destructor of the object-copy that was made for use by the function is called.
Now, in your case, a trivially copied object-copy for lock would have the same value of locked. Indeed this is necessary for your design.
At the completion of the kernel, the object-copy has its destructor called, and in this case the user-provided destructor deallocates locked, which is numerically the same pointer in both lock and its object-copy.
As a result of this, from that point forward, further use of the pointer is UB.
At the completion of your program, the original lock object goes out of scope, and therefore calls its destructor. This results in a call to cudaFree on the same numerical value already freed. As a result, an error is thrown.

I think overall this is not a robust design. It’s difficult to fix this issue simply by attempting to modify the destructor (only). One possible approach is to leave the destructor empty, and manually provide and call a deallocate method.

expye · July 12, 2021, 1:35pm

@Robert_Crovella , Thanks very much for your reply! Yes adding a threadfence solves the problem.

I have some further questions:

What does _threadfence do in this program? Write *total to global memory and notify other active blocks?
For the memory issue, I tried to make locked a CPU variable and let CUDA copy it to GPU implicitly:

struct Lock {
  int locked[1];

  Lock() {
    *locked = 0; 
  }

  ~Lock() {}

  __device__ __forceinline__ void acquire_lock() {
    while (atomicCAS(locked, 0, 1) != 0);
  }

  __device__ __forceinline__ void unlock() {
    atomicExch(locked, 0);
  }
};

However, the output become 0.

Robert_Crovella · July 12, 2021, 2:50pm

Please read the documentation. If you want to know what has happened in detail at the device code level, I suggest you learn to use the cuda binary utilities.

Did you run that code with cuda-memcheck or compute-sanitizer ?

There is more than one problem with this approach. cuda-memcheck results will cast doubt on this methodology. Another way to indict this approach is to think carefully about it from a C++ perspective. Suppose you passed an object of that struct to two different functions (or two different calls to the same function). How would you expect it to behave? When they do an atomic update to a variable in the object, would both function calls be updating the same variable? (hint: they would not). This is again due to pass-by-value characteristics. So your approach is not logical from a C++ perspective either.

expye · July 12, 2021, 3:13pm

Yes both tools tell me I encounter the following errors:

Program hit unspecified launch failure (error 719) on CUDA API call to cudaMemcpy.
Program hit unspecified launch failure (error 719) on CUDA API call to cudaFree.

Robert_Crovella · July 12, 2021, 3:14pm

Correct. So the methodology is broken from a CUDA coding perspective. In addition it is not a logical solution from a C++ perspective. Please read the update that I edited into my previous response.

Without trying to dive deeply into what exactly is going on that causes the CUDA runtime error, we can indict this approach from a CUDA coding perspective as follows:

atomics in CUDA are provided which can operate on global entities or shared entities. No atomics are provided that operate on local space entities. (This is not an oversight; local space entities are only accessible from a single thread. Therefore atomics would be of questionable value.)
items passed to a function via function arguments are in the local space of that function (this is connected to pass-by-value semantics, also).

therefore, it’s not sensible to conclude that CUDA can do atomics in the fashion you are now proposing. the tools seem to confirm this.

expye · July 12, 2021, 3:20pm

Yes I agree with you, if pass-by-value then locked should behave like a thread-local register variable thus not usable for locking at all.

Robert_Crovella · July 12, 2021, 3:23pm

using the same reasoning, would could imagine that CUDA atomics won’t work in this scenario also. I’ve updated my previous post with some more verbiage around this. This might be an explanation for the tool result.

system · September 10, 2021, 3:23pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to pass a struct to a kernel? CUDA Programming and Performance	8	2280	March 19, 2019
(Errata for CUDA by Example) Is atomicCAS() safe to simulate lock even with __threadfence()? CUDA Programming and Performance	11	3162	September 17, 2018
Understanding a spinlock implementation by Robert Crovella CUDA Programming and Performance	6	1483	September 26, 2023
unexpected behaviour from atomics on Volta CUDA Programming and Performance	12	1585	May 31, 2019
atomicCAS mutex not working on 2080ti? CUDA Programming and Performance	3	1041	November 30, 2019
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19746	January 21, 2010
Free memory after finished kernel CUDA Programming and Performance	4	1195	June 14, 2016
How's atomic operations in CUDA implemented? CUDA Programming and Performance cuda , kernel , programming	8	2899	March 26, 2024
Problem with arch=sm_20 CUDA Programming and Performance	16	4231	March 4, 2011
Problem with lock using atomicCAS CUDA Programming and Performance	3	3533	July 19, 2014

Weird behavior of atomic operations on Ampere architecture GPUs

Related topics