Implementing global lock (critical section) on 1-block cuda process

Im running CUDA on linux 22.04

Device, drivers and CUDA version info:

NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5  
NVIDIA GeForce GTX 1650

I am developing an Interaction Nets emulator with C and CUDA, and I am trying to perform parallel reductions on integers arrays.

This is my reduce kernel:

__global__ void reduce_kernel(int *cell_conns, int *cell_types, int *conn_rules, int *cell_count, int **arr_cell, int **arr_ports, int *lock) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx > MAX_CELLS) return;

  int conn = cell_conns[idx];
  int rule = conn_rules[idx];

  if (conn == -1) return;

  printf("Im thread %i and got here!\n", idx);

  // get the main ports
  int *a_connected_cell = arr_cell[idx];
  int *a_connected_port = arr_ports[idx];
  int a_type = cell_types[idx];

  int *b_connected_cell = arr_cell[conn];
  int *b_connected_port = arr_ports[conn];
  int b_type = cell_types[conn];

  if (a_connected_cell == NULL || a_connected_port == NULL || b_connected_cell == NULL || b_connected_port == NULL || cell_types[idx] == -1 || cell_types[conn] == -1) {
    return;
  }

  printf("Thread %i waiting for lock\n", idx);
  while (atomicCAS(lock, 0, 1) != 0) {
  };
  printf("Thread %i acquired lock!\n", idx);

  __threadfence();

  if (rule == SUC_SUM) {
    printf("SUC SUM with %i and %i\n", idx, conn);
    if (a_type == SUM && b_type == SUC) {
      suc_sum_c(arr_cell, arr_ports, cell_types, conn, idx, cell_count, lock);
    } else {
      suc_sum_c(arr_cell, arr_ports, cell_types, idx, conn, cell_count, lock);
    }
    printf("got out of suc sum!");
  } else if (rule == ZERO_SUM) {
    printf("ZERO SUM with %i and %i\n", idx, conn);
    if (a_type == SUM && b_type == ZERO) {
      zero_sum_c(arr_cell, arr_ports, cell_types, conn, idx, cell_count, lock);
    } else {
      zero_sum_c(arr_cell, arr_ports, cell_types, idx, conn, cell_count, lock);
    }
    printf("got out of zero sum!");
  }

  __threadfence();
  atomicExch(lock, 0);
  printf("Thread %i released lock.\n", idx);
  __threadfence();
}

Prints are there for debugging. Functions zero_sum_c and suc_sum_c have no loops and simply change arr_cell and arr_ports values. No lock is implemented in them. The full implementation lives here: inets/cuda/inets.cu at main · Lorenzobattistela/inets · GitHub.

The output I get is:

Im thread 10 and got here!
Im thread 80 and got here!
Im thread 33 and got here!
Im thread 57 and got here!
Thread 10 waiting for lock
Thread 80 waiting for lock
Thread 33 waiting for lock
Thread 57 waiting for lock
Thread 10 acquired lock!
SUC SUM with 10 and 22
got out of suc sum!Thread 10 released lock.
Thread 80 acquired lock!
SUC SUM with 80 and 92
got out of suc sum!Thread 80 released lock.

And after thread 80 releases the lock, it gets deadlocked.

How and why is this happening? Its my first CUDA implementation and contact with it, so if you need any additional info, feel free to ask me.

I dont get how it gets deadlocked since i have only one block of 1024 threads. Only 4 threads get there because there are only 4 reductions to be performed. These 4 threads are fighting for the global lock.

Global lock is prepared for the kernel:

  int *d_lock;
  cudaMalloc(&d_lock, sizeof(int));

  cudaMemset(d_lock, 0, sizeof(int));

Thanks in advance

locks and critical sections are discussed in a number of various forum posts. Here are a few examples: 1 2

My guess would be the odd behavior arises because thread 33 and 57 are part of the same warp, and that statement cannot be applied to any other pair among the 4 threads you have listed. If you want to see an example of how this could matter, take a look at the 2nd link I provided. I’m not saying that is an exact description of your case (it obviously isn’t) but it is an example of how the warp association, coupled with the exact code generated by the compiler, and other factors, could give rise to the deadlock. In my experience each of these cases tends to be fairly specific and requires SASS analysis for full root-cause understanding.

In these situations, if you want further understanding, a complete test case is needed (and I personally don’t wish to assemble the code for that by poking around. Nothing prevents you from providing a short, complete example here.) In addition, we need the GPU you are running on and CUDA version and OS (which items you have indicated) as well as the exact compile command.

For example it wouldn’t surprise me if you saw different behavior in your test case by compiling explicitly for your device (e.g. -arch=sm_75) vs. not (i.e. providing no arch specification).

For me, personally, rather than trying to do a roll-my-own lock implementation, I would use the semaphore method that is linked from the first link I provided.

I would generally suggest compiling for the device you intend to run on. I’m not saying how that impacts your test case exactly, I’m just providing that as a general suggestion. You may be doing that already.