Atomicas deadlock

shabarovs · May 13, 2023, 3:02pm

Hi

I have an array A with values of 0, and I want to increment some of it’s elements by 1. The indices of an array A which I want to increment are stored in array B. I need to increment some elements several times, thus Im trying to use an array of mutexes for each of elements in array A. But when I launch my code, the program hangs and I get deadlock. ( It only works when i set thread per block size to 1, but it’s not what i want )

I’m stuck with this issue. What I ultimately want to do is to draw a continuous brush stroke that overlaps itself using cuda, thus I need to access the same pixels of canvas image in parralel.

here is my code

#include <iostream>
using namespace std;

__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
    int ind = indices[index]; // indices of target array A to increment    

    if (index < nof_indices) {
        while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&d_semaphores[ind], 0);
        __syncthreads();
    }
}

int main()
{
    int nof_indices = 6; // length of an array B
    int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
    int canvas[10]; // array A
    int semaphores[10]; // mutex array with individual mutexes for each of array A elements

    int* d_canvas;
    int* d_indices;
    int* d_semaphores;

    memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
    memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0    

    cudaMalloc(&d_canvas, sizeof(canvas));
    cudaMalloc(&d_semaphores, sizeof(semaphores));
    cudaMalloc(&d_indices, sizeof(indices));

    cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);

    add_kernel << <1, 6 >> > (d_canvas, d_indices, d_semaphores, nof_indices);

    cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);

    for (int it = 0; it < nof_indices; it++) {
        cout << canvas[it] << endl;
    }

    cudaFree(d_canvas);
    cudaFree(d_indices);
    cudaFree(d_semaphores);

    return 0;
}

in this example the resulting array A should look like this : {1, 2 ,1 ,1,1,0} , but I only get it when I run kernel with dimensions << 6,1 >>.

I’m using cuda 12.1, geforce rtx3060

Thank you

Robert_Crovella · May 13, 2023, 3:17pm

when posting code on this forum, please format it properly. As a simple example, edit your post by clicking on the pencil icon underneath it. Then select your code. Then click the </> button at the top of the edit window, then save your changes.

Please do that now.

striker159 · May 13, 2023, 5:46pm

Is there a reason why you do not simply increment by 1 atomically?

__global__ void add_kernel(int* matrix, int* indices, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index < nof_indices) {
        int ind = indices[index];
        atomicAdd(matrix + ind, 1);
    }
}

The deadlock probably arises because the threads in a warp are executed in lock-step. Threads do not exit the loop until the condition is true for all threads in the warp. But this can never be the case since the lock for element 1 will never be available for thread 1 or thread 5.

Robert_Crovella · May 13, 2023, 6:04pm

I think it should not deadlock if you compile for the architecture you are running on, eg. -arch=sm_86

Robert_Crovella · May 13, 2023, 6:23pm

shabarovs · May 13, 2023, 6:53pm

Thanks Robert_Crovella, it works now!

shabarovs · May 13, 2023, 7:12pm

works like a charm now

Topic		Replies	Views
atomicCAS for mutiple blocks & mutiple threads - CUDA 3.2 - Fedora 10 CUDA Programming and Performance	7	2503	April 25, 2011
Question regarding CUDA streams CUDA Programming and Performance	4	2473	May 21, 2009
Really simple while loop issues CUDA Programming and Performance	4	3172	October 27, 2014
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19752	January 21, 2010
Std::cuda::atomic::load() deadlock CUDA Programming and Performance cuda	1	287	April 3, 2024
why this deadlocks? try to invoke a critical area CUDA Programming and Performance	11	6096	November 6, 2009
Problem of Hash Table Lock in CUDA CUDA Programming and Performance	6	1272	July 16, 2018
Deadlock in busy waiting queue CUDA Programming and Performance cuda	6	293	June 13, 2024
atomicCAS issue (possible deadlock) CUDA Programming and Performance	5	3232	October 26, 2011
CUDA deadlock issues in emulation mode CUDA Programming and Performance	5	3700	June 9, 2009

Atomicas deadlock

Related topics