Deadlock in busy waiting queue

chaosangel · June 12, 2024, 8:55am

I have a minimal (non)working example of a deadlock that I produce with 2 blocks and 32threads each.
It comes from a bigger problem where I try to use busy waiting to feed each block a piece of work until the end of the queue is reached. I know that this could be done smarter for this reduced example, this is just for showing the deadlock behaviour.
Whenever I add any other printf statement to my kernel or if I enter a breakpoint the deadlock disappears, otherwise it occurs all the time on my hardware.
NVIDIA RTX A500 Laptop Driver 535.171.04 Cuda 12.2

The code uses a custom atomic that increases a global variable in memory if it is not above a certain threshold. If the value was increased, this is the next workpackage for the block.

#include <iostream>
#include <cuda_runtime.h>


#define gpuErrchk(ans)                                                         \
  { gpuAssert((ans), __FILE__, __LINE__); }
  
__inline__ void gpuAssert(cudaError_t code, const char *file, int line) {
  if (code != cudaSuccess) {
    printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    std::cout << std::endl;
  }
}


// a function that increments the device qhead for each block entering
// do not increment over maximum trail size
__device__ bool atomicAddThreshold(unsigned int* address, unsigned int* threshold, unsigned int& qhead) {
  
  __shared__ bool increased;
  increased = false;
  __syncthreads();
  if (threadIdx.x == 0) {
    unsigned int next;
    unsigned int old = *address, assumed;
    do {
      assumed = old;
      next = assumed + 1 > *threshold ? *threshold : assumed + 1;
      old = atomicCAS(address, assumed, next);
    } while (assumed != old);
    qhead = old;
    increased = next > assumed;
  }
  printf("before syncing threads in atomicAddThreshold, %d/%d has qhead %d\n", threadIdx.x, blockIdx.x, qhead);
  __syncthreads();
  return increased;
}

// stripped down version of busy working queue
// as long as there is work to do (qhead < trail_size) get next index to work on
__global__ void foo(unsigned int* qhead, unsigned int* device_trail_size, unsigned int* inactive_blocks) {

 __shared__ bool finished;
 __shared__ bool all_finished;
 __shared__ unsigned int local_qhead;
  finished = false;
  all_finished = false;
  local_qhead = 0;
  __syncthreads();

  while(true) { // busy waiting loop

    if (!finished && atomicAddThreshold(qhead, device_trail_size, local_qhead)) {
        // work() // do some work
        printf("Block/Thread %d %d entering nary propagation with qhead %d\n", blockIdx.x, threadIdx.x, local_qhead);
        if (threadIdx.x == 0) local_qhead++;
    }
  
    __syncthreads();
    
    if (threadIdx.x == 0) {
      if (!finished) {
          finished = true;
          atomicAdd(inactive_blocks, 1);
      }
      if (finished && *inactive_blocks == gridDim.x)
        all_finished = true;
    }
    __syncthreads();
    
    if (all_finished) {
      printf("FINISHED PROPAGATION FOR BLOCK %d and THREAD %d\n", blockIdx.x, threadIdx.x);
      break;
    }
  } 
}


int main()
{

  unsigned int number = 0;
  unsigned int* qhead;
  unsigned int* device_trail_size;
  unsigned int* inactive_blocks;
  gpuErrchk(cudaMalloc((void **)&qhead, sizeof(unsigned int)));
  gpuErrchk(cudaMemcpy(qhead, &number, sizeof(unsigned int), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMalloc((void **)&device_trail_size, sizeof(unsigned int)));
  gpuErrchk(cudaMemcpy(device_trail_size, &number, sizeof(unsigned int), cudaMemcpyHostToDevice));
  gpuErrchk(cudaMalloc((void **)&inactive_blocks, sizeof(unsigned int)));
  gpuErrchk(cudaMemcpy(inactive_blocks, &number, sizeof(unsigned int), cudaMemcpyHostToDevice));

  foo<<<2, 32>>>(qhead, device_trail_size, inactive_blocks);

  gpuErrchk( cudaPeekAtLastError() );
  gpuErrchk( cudaDeviceSynchronize() );
  std::cout << "kernel finished" << std::endl;

}

Compiled using nvcc -g -G -std=c++14 "--generate-code=arch=compute_86,code=[compute_86,sm_86]" -o test test.cu

Any help is appreciated to understand why this deadlocks.
All threads reach the end of atomicAddThreshold, but only 1 block finishes (either block 0 or 1). I do not know where all my other threads get stuck.

Robert_Crovella · June 12, 2024, 6:58pm

regarding this:

There is no guarantee that the value of *inactive_blocks will ever reflect any updates to the quantity stored in *inactive_blocks. To ensure visibility of updates from other threads/blocks, one approach would be to mark the inactive_blocks pointer as volatile.

chaosangel · June 13, 2024, 7:14am

Thank you, this actually updates the value correctly.

If I change the definition to volatile unsigned int*, I have to const_cast away the volatile for the atomic functions. This works but it does not seem right. Shouldn’t the address parameter of atomicXXX be volatile, as e.g. atomicAdd would need to make an actual read of the memory and also and actual write?
I guess using volatile is not the same as calling __threadfence() after each atomicIncrement, right?
How would a solution using __threadfence() look like, would it even be possible to “enfore” a read?

Robert_Crovella · June 13, 2024, 12:48pm

This question comes up from time to time. Yes, you have to cast away the volatile when doing atomics. Atomics don’t need volatile to define their behavior. It is already defined.

I don’t think __threadfence(), by itself, is a solution. You might prove me wrong. The reason I say that is that __threadfence() doesn’t make the same kind of guarantee that volatile does (volatile has to do with visibility, not ordering), and the volatile behavior is needed when communicating between threads in global memory. (ordering may also be needed) __threadfence() is an ordering guarantee. It says that, roughly, if thread X writes A to a location, then writes B, and calls threadfence after each write, then there is no possibility that another thread Y will read B from that location, and subsequently read A. It does not say that any other thread will ever be guaranteed to read B “eventually”. But we need a guarantee that B will “eventually” be read by another thread, if that other thread is reading that location. There are probably other ways to accomplish this kind of inter-thread communication via global memory, with more substantial refactoring of your code, but __threadfence() by itself as an addition to your code somewhere seems insufficient to me.

You may prove me wrong of course.

chaosangel · June 13, 2024, 3:51pm

I feared so, thank you.
I consider redesigning my algorithm and encountered a question to see what else then volatile assures that global memory can be read/written bei different threads.

Given only 1 block in this example

if (threadIdx.x == 23)
  *global_x = 5
__syncthreads()
if (threadIdx.x == 42)
  int a = *global_x

Does __syncthreads ensure not only that the global memory is written but also that thread 42 hasn’t cached *global_x somewhere from before and I can actually be safe that a will be 5?

Robert_Crovella · June 13, 2024, 4:05pm

Yes, syncthreads carries two mechanisms:

An execution barrier: no thread may proceed beyond the barrier until all threads reach the barrier.
A “memory” barrier, that applies to both the logical global and logical shared spaces. This ensures “visibility”.

The scope of both of these is the threads in a block. It provides no guarantees from threads in one block to threads in another block.

The combination of those two mechanisms can also provide ordering that was discussed previously. So working at the threadblock level is generally more flexible than working at the grid level.

system · June 27, 2024, 4:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19733	January 21, 2010
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29110	April 22, 2015
Global thread barrier CUDA Programming and Performance	78	85521	December 23, 2011
this code resets my computer CUDA Programming and Performance	21	7190	March 29, 2008
Deadlock that shouldn't occur CUDA Programming and Performance	5	4275	January 27, 2010
Question related __threadfence CUDA Programming and Performance	13	5030	January 12, 2016
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8780	December 1, 2009
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2770	December 8, 2010
CUDA Memory Consistency CUDA Programming and Performance	23	55429	March 8, 2007
Correct use of _threadfence() to remove the RAW race Cannot remove race condition CUDA Programming and Performance	14	3741	April 23, 2012

Deadlock in busy waiting queue

Related topics