Std::cuda::atomic::load() deadlock

Hello,
I upgraded from CUDA 11.0 to CUDA 12.4, and a kernel that uses atomics in CUDA 12 deadlocks.

The code works as follows: in every block there are 2 active threads, a producer and a consumer:

  • the producer thread increments an atomic variable ‘n_blocks’ when produces a block of data,
  • the consumer thread decrements the atomic variable when acquires the block of data.

The kernel is launched with 1024 threads that are needed before the problematic piece of code.

// 1024 threads are active
__shared__ ::cuda::atomic<std::uint32_t, ::cuda::thread_scope_block> n_blocks;
__shared__ ::cuda::atomic<bool, ::cuda::thread_scope_block>          done;

if (threadIdx.x == 0)
{
   n_blocks.store(0);
   done.store(false);
}
__syncthreads();

if (threadIdx.x == 0)
{  // producer thread
   while(!done.load())
   {
      while(n_blocks.load() == max_number_of_blocks)
      {
          // do nothing, all blocks have been produced, wait for an empty slot
      }
      // produce a block and increment 'n_blocks'
      n_blocks.fetch_add(1);
   }
}

if (threadIdx.x == 1)
{  // consumer
   for(...) // all the data to be processed
   {
      while(n_blocks.load() == 0)
      {
          // do nothing, no available blocks, wait for a new block
      }
      // acquire the available block, use it, mark it as "used" decrementing the counter
      n_blocks.fetch_sub(1);       
   }
   // when done, update the atomic 'done' in order to stop thread 0
   done.store(true);
}
// kernel exit

In CUDA 11 this works fine, in CUDA 12 it stucks in the while loop of the producer, specifically in the method atomic::load(). With the debugger, I collect the following call stack:

[CUDA]main.cu!__cuda_fence_sc_block Line 14 [0x0000001201378ac0]	
[CUDA]main.cu!__atomic_load_cuda Line 47 [0x0000001201378ac0]	
[CUDA]main.cu!__atomic_load_n_cuda Line 120 [0x0000001201378ac0]	
[CUDA]main.cu!__cxx_atomic_load Line 207 [0x0000001201378ac0]	
[CUDA]main.cu!__cxx_atomic_load Line 394 [0x0000001201378ac0]	
[CUDA]main.cu!memory_order) const Line 1343 [0x0000001201378ac0]

The last function is implemented as follows:

static inline _LIBCUDACXX_DEVICE void __cuda_fence_sc_block() { asm volatile("fence.sc.cta;":::"memory"); }

Can someone help me to understand what is going on? Is this linked to the fact that only two threads end up in a synchronization barrier and the implementation waits for all the 1024 to arrive there? But then, the atomic::load() function can only be called by every thread in a block?

I suggest providing a short, complete example.