Weird behavior of atomic operations on Ampere architecture GPUs

I tried the following code on two machines, one with GTX 1060 and another one with RTX 3070:

#include <iostream>
#include <cuda_runtime.h>

constexpr int num_blocks = 1024;
constexpr int num_threads = 32;

struct Lock {
  int *locked;

  Lock() {
    int init = 0;
    cudaMalloc(&locked, sizeof(int));
    cudaMemcpy(locked, &init, sizeof(int), cudaMemcpyHostToDevice);
  }

  ~Lock() {
    cudaFree(locked);
  }

  __device__ __forceinline__ void acquire_lock() {
    while (atomicCAS(locked, 0, 1) != 0);
  }

  __device__ __forceinline__ void unlock() {
    atomicExch(locked, 0);
  }
};

__global__ void counter(Lock lock, int *total) {
  if (threadIdx.x == 1) {
    lock.acquire_lock();
    *total = *total + 1;
    lock.unlock();
  }
}

int main() {
  int *total_dev;
  cudaMalloc(&total_dev, sizeof(int));
  int total_host = 0;
  cudaMemcpy(total_dev, &total_host, sizeof(int), cudaMemcpyHostToDevice);
  Lock lock;
  counter<<<num_blocks, num_threads>>>(lock, total_dev);
  cudaDeviceSynchronize();
  cudaMemcpy(&total_host, total_dev, sizeof(int), cudaMemcpyDeviceToHost);
  std::cout << total_host << std::endl;
  cudaFree(total_dev); 
}

What the program did is basically counting the number of blocks, with the help of lock constructed by atomicCAS and atomicExch instructions.

On my GTX 1060 machine, the program successfully outputs the expected result 1024, however it didn’t produce stable result on my RTX 3070 machine (varies from 40 to 100 in multiple runs). I use cuda 11.3 on both instances.

when sharing global state amongst threadblocks, the lock/unlock you have is insufficient to ensure correctness. memory access ordering also matters, taking into account caches. I think it’s likely you could fix this by adding volatile keyword to int *total.

__global__ void counter(Lock lock, volatile int *total) {

That does “seem” to fix it for me, but after thinking about this a bit more, I believe you might be better advised to put a threadfence instead:

__global__ void counter(Lock lock, int *total) {
  if (threadIdx.x == 1) {
    lock.acquire_lock();
    *total = *total + 1;
    __threadfence();
    lock.unlock();
  }
}

On the subject of design flaws, your code will throw an error if run with cuda-memcheck.

The reason for this is as follows.

  • You are passing lock by-value to the kernel.
  • When we pass a class/struct in C++ by-value to a function, a local copy of that class/struct is made for use by the function. This local copy will either use a user-supplied copy constructor, or a default copy-constructor if no user-provided copy constructor is provided (assuming the class is trivially copyable, which this one is.)
  • At the completion of the function call (the kernel, in this case) the destructor of the object-copy that was made for use by the function is called.
  • Now, in your case, a trivially copied object-copy for lock would have the same value of locked. Indeed this is necessary for your design.
  • At the completion of the kernel, the object-copy has its destructor called, and in this case the user-provided destructor deallocates locked, which is numerically the same pointer in both lock and its object-copy.
  • As a result of this, from that point forward, further use of the pointer is UB.
  • At the completion of your program, the original lock object goes out of scope, and therefore calls its destructor. This results in a call to cudaFree on the same numerical value already freed. As a result, an error is thrown.

I think overall this is not a robust design. It’s difficult to fix this issue simply by attempting to modify the destructor (only). One possible approach is to leave the destructor empty, and manually provide and call a deallocate method.

1 Like

@Robert_Crovella , Thanks very much for your reply! Yes adding a threadfence solves the problem.

I have some further questions:

  1. What does _threadfence do in this program? Write *total to global memory and notify other active blocks?
  2. For the memory issue, I tried to make locked a CPU variable and let CUDA copy it to GPU implicitly:
struct Lock {
  int locked[1];

  Lock() {
    *locked = 0; 
  }

  ~Lock() {}

  __device__ __forceinline__ void acquire_lock() {
    while (atomicCAS(locked, 0, 1) != 0);
  }

  __device__ __forceinline__ void unlock() {
    atomicExch(locked, 0);
  }
};

However, the output become 0.

Please read the documentation. If you want to know what has happened in detail at the device code level, I suggest you learn to use the cuda binary utilities.

Did you run that code with cuda-memcheck or compute-sanitizer ?

There is more than one problem with this approach. cuda-memcheck results will cast doubt on this methodology. Another way to indict this approach is to think carefully about it from a C++ perspective. Suppose you passed an object of that struct to two different functions (or two different calls to the same function). How would you expect it to behave? When they do an atomic update to a variable in the object, would both function calls be updating the same variable? (hint: they would not). This is again due to pass-by-value characteristics. So your approach is not logical from a C++ perspective either.

Yes both tools tell me I encounter the following errors:

Program hit unspecified launch failure (error 719) on CUDA API call to cudaMemcpy.
Program hit unspecified launch failure (error 719) on CUDA API call to cudaFree.

Correct. So the methodology is broken from a CUDA coding perspective. In addition it is not a logical solution from a C++ perspective. Please read the update that I edited into my previous response.

Without trying to dive deeply into what exactly is going on that causes the CUDA runtime error, we can indict this approach from a CUDA coding perspective as follows:

  • atomics in CUDA are provided which can operate on global entities or shared entities. No atomics are provided that operate on local space entities. (This is not an oversight; local space entities are only accessible from a single thread. Therefore atomics would be of questionable value.)
  • items passed to a function via function arguments are in the local space of that function (this is connected to pass-by-value semantics, also).

therefore, it’s not sensible to conclude that CUDA can do atomics in the fashion you are now proposing. the tools seem to confirm this.

1 Like

Yes I agree with you, if pass-by-value then locked should behave like a thread-local register variable thus not usable for locking at all.

using the same reasoning, would could imagine that CUDA atomics won’t work in this scenario also. I’ve updated my previous post with some more verbiage around this. This might be an explanation for the tool result.

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.