Implementing global lock (critical section) on 1-block cuda process

lorenzowottrich · June 4, 2024, 12:57pm

Im running CUDA on linux 22.04

Device, drivers and CUDA version info:

NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5  
NVIDIA GeForce GTX 1650

I am developing an Interaction Nets emulator with C and CUDA, and I am trying to perform parallel reductions on integers arrays.

This is my reduce kernel:

__global__ void reduce_kernel(int *cell_conns, int *cell_types, int *conn_rules, int *cell_count, int **arr_cell, int **arr_ports, int *lock) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx > MAX_CELLS) return;

  int conn = cell_conns[idx];
  int rule = conn_rules[idx];

  if (conn == -1) return;

  printf("Im thread %i and got here!\n", idx);

  // get the main ports
  int *a_connected_cell = arr_cell[idx];
  int *a_connected_port = arr_ports[idx];
  int a_type = cell_types[idx];

  int *b_connected_cell = arr_cell[conn];
  int *b_connected_port = arr_ports[conn];
  int b_type = cell_types[conn];

  if (a_connected_cell == NULL || a_connected_port == NULL || b_connected_cell == NULL || b_connected_port == NULL || cell_types[idx] == -1 || cell_types[conn] == -1) {
    return;
  }

  printf("Thread %i waiting for lock\n", idx);
  while (atomicCAS(lock, 0, 1) != 0) {
  };
  printf("Thread %i acquired lock!\n", idx);

  __threadfence();

  if (rule == SUC_SUM) {
    printf("SUC SUM with %i and %i\n", idx, conn);
    if (a_type == SUM && b_type == SUC) {
      suc_sum_c(arr_cell, arr_ports, cell_types, conn, idx, cell_count, lock);
    } else {
      suc_sum_c(arr_cell, arr_ports, cell_types, idx, conn, cell_count, lock);
    }
    printf("got out of suc sum!");
  } else if (rule == ZERO_SUM) {
    printf("ZERO SUM with %i and %i\n", idx, conn);
    if (a_type == SUM && b_type == ZERO) {
      zero_sum_c(arr_cell, arr_ports, cell_types, conn, idx, cell_count, lock);
    } else {
      zero_sum_c(arr_cell, arr_ports, cell_types, idx, conn, cell_count, lock);
    }
    printf("got out of zero sum!");
  }

  __threadfence();
  atomicExch(lock, 0);
  printf("Thread %i released lock.\n", idx);
  __threadfence();
}

Prints are there for debugging. Functions zero_sum_c and suc_sum_c have no loops and simply change arr_cell and arr_ports values. No lock is implemented in them. The full implementation lives here: inets/cuda/inets.cu at main · Lorenzobattistela/inets · GitHub.

The output I get is:

Im thread 10 and got here!
Im thread 80 and got here!
Im thread 33 and got here!
Im thread 57 and got here!
Thread 10 waiting for lock
Thread 80 waiting for lock
Thread 33 waiting for lock
Thread 57 waiting for lock
Thread 10 acquired lock!
SUC SUM with 10 and 22
got out of suc sum!Thread 10 released lock.
Thread 80 acquired lock!
SUC SUM with 80 and 92
got out of suc sum!Thread 80 released lock.

And after thread 80 releases the lock, it gets deadlocked.

How and why is this happening? Its my first CUDA implementation and contact with it, so if you need any additional info, feel free to ask me.

I dont get how it gets deadlocked since i have only one block of 1024 threads. Only 4 threads get there because there are only 4 reductions to be performed. These 4 threads are fighting for the global lock.

Global lock is prepared for the kernel:

  int *d_lock;
  cudaMalloc(&d_lock, sizeof(int));

  cudaMemset(d_lock, 0, sizeof(int));

Thanks in advance

Robert_Crovella · June 4, 2024, 2:15pm

locks and critical sections are discussed in a number of various forum posts. Here are a few examples: 1 2

My guess would be the odd behavior arises because thread 33 and 57 are part of the same warp, and that statement cannot be applied to any other pair among the 4 threads you have listed. If you want to see an example of how this could matter, take a look at the 2nd link I provided. I’m not saying that is an exact description of your case (it obviously isn’t) but it is an example of how the warp association, coupled with the exact code generated by the compiler, and other factors, could give rise to the deadlock. In my experience each of these cases tends to be fairly specific and requires SASS analysis for full root-cause understanding.

In these situations, if you want further understanding, a complete test case is needed (and I personally don’t wish to assemble the code for that by poking around. Nothing prevents you from providing a short, complete example here.) In addition, we need the GPU you are running on and CUDA version and OS (which items you have indicated) as well as the exact compile command.

For example it wouldn’t surprise me if you saw different behavior in your test case by compiling explicitly for your device (e.g. -arch=sm_75) vs. not (i.e. providing no arch specification).

For me, personally, rather than trying to do a roll-my-own lock implementation, I would use the semaphore method that is linked from the first link I provided.

I would generally suggest compiling for the device you intend to run on. I’m not saying how that impacts your test case exactly, I’m just providing that as a general suggestion. You may be doing that already.

Topic		Replies	Views
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6099	November 6, 2009
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1278	July 16, 2018
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1794	January 14, 2009
DeadLock on this piece of code. Require your help. CUDA Programming and Performance	13	4738	March 17, 2010
Atomic Operations in CUDA CUDA Programming and Performance	5	29237	June 9, 2009
Need help about global memory access by threads CUDA Programming and Performance	4	1187	April 6, 2010
device global memory update questions CUDA Programming and Performance	7	5846	April 20, 2009
Problem with lock using atomicCAS CUDA Programming and Performance	3	3549	July 19, 2014
Thread Synchronization CUDA Programming and Performance	4	8502	October 2, 2009
Global thread barrier CUDA Programming and Performance	78	85697	December 23, 2011

Implementing global lock (critical section) on 1-block cuda process

Related topics