Im running CUDA on linux 22.04
Device, drivers and CUDA version info:
NVIDIA-SMI 555.42.02 Driver Version: 555.42.02 CUDA Version: 12.5
NVIDIA GeForce GTX 1650
I am developing an Interaction Nets emulator with C and CUDA, and I am trying to perform parallel reductions on integers arrays.
This is my reduce kernel:
__global__ void reduce_kernel(int *cell_conns, int *cell_types, int *conn_rules, int *cell_count, int **arr_cell, int **arr_ports, int *lock) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx > MAX_CELLS) return;
int conn = cell_conns[idx];
int rule = conn_rules[idx];
if (conn == -1) return;
printf("Im thread %i and got here!\n", idx);
// get the main ports
int *a_connected_cell = arr_cell[idx];
int *a_connected_port = arr_ports[idx];
int a_type = cell_types[idx];
int *b_connected_cell = arr_cell[conn];
int *b_connected_port = arr_ports[conn];
int b_type = cell_types[conn];
if (a_connected_cell == NULL || a_connected_port == NULL || b_connected_cell == NULL || b_connected_port == NULL || cell_types[idx] == -1 || cell_types[conn] == -1) {
return;
}
printf("Thread %i waiting for lock\n", idx);
while (atomicCAS(lock, 0, 1) != 0) {
};
printf("Thread %i acquired lock!\n", idx);
__threadfence();
if (rule == SUC_SUM) {
printf("SUC SUM with %i and %i\n", idx, conn);
if (a_type == SUM && b_type == SUC) {
suc_sum_c(arr_cell, arr_ports, cell_types, conn, idx, cell_count, lock);
} else {
suc_sum_c(arr_cell, arr_ports, cell_types, idx, conn, cell_count, lock);
}
printf("got out of suc sum!");
} else if (rule == ZERO_SUM) {
printf("ZERO SUM with %i and %i\n", idx, conn);
if (a_type == SUM && b_type == ZERO) {
zero_sum_c(arr_cell, arr_ports, cell_types, conn, idx, cell_count, lock);
} else {
zero_sum_c(arr_cell, arr_ports, cell_types, idx, conn, cell_count, lock);
}
printf("got out of zero sum!");
}
__threadfence();
atomicExch(lock, 0);
printf("Thread %i released lock.\n", idx);
__threadfence();
}
Prints are there for debugging. Functions zero_sum_c and suc_sum_c have no loops and simply change arr_cell and arr_ports values. No lock is implemented in them. The full implementation lives here: inets/cuda/inets.cu at main · Lorenzobattistela/inets · GitHub.
The output I get is:
Im thread 10 and got here!
Im thread 80 and got here!
Im thread 33 and got here!
Im thread 57 and got here!
Thread 10 waiting for lock
Thread 80 waiting for lock
Thread 33 waiting for lock
Thread 57 waiting for lock
Thread 10 acquired lock!
SUC SUM with 10 and 22
got out of suc sum!Thread 10 released lock.
Thread 80 acquired lock!
SUC SUM with 80 and 92
got out of suc sum!Thread 80 released lock.
And after thread 80 releases the lock, it gets deadlocked.
How and why is this happening? Its my first CUDA implementation and contact with it, so if you need any additional info, feel free to ask me.
I dont get how it gets deadlocked since i have only one block of 1024 threads. Only 4 threads get there because there are only 4 reductions to be performed. These 4 threads are fighting for the global lock.
Global lock is prepared for the kernel:
int *d_lock;
cudaMalloc(&d_lock, sizeof(int));
cudaMemset(d_lock, 0, sizeof(int));
Thanks in advance