Atomicas deadlock


I have an array A with values of 0, and I want to increment some of it’s elements by 1. The indices of an array A which I want to increment are stored in array B. I need to increment some elements several times, thus Im trying to use an array of mutexes for each of elements in array A. But when I launch my code, the program hangs and I get deadlock. ( It only works when i set thread per block size to 1, but it’s not what i want )

I’m stuck with this issue. What I ultimately want to do is to draw a continuous brush stroke that overlaps itself using cuda, thus I need to access the same pixels of canvas image in parralel.

here is my code

#include <iostream>
using namespace std;

__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
    int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
    int ind = indices[index]; // indices of target array A to increment    

    if (index < nof_indices) {
        while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&d_semaphores[ind], 0);

int main()
    int nof_indices = 6; // length of an array B
    int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
    int canvas[10]; // array A
    int semaphores[10]; // mutex array with individual mutexes for each of array A elements

    int* d_canvas;
    int* d_indices;
    int* d_semaphores;

    memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
    memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0    

    cudaMalloc(&d_canvas, sizeof(canvas));
    cudaMalloc(&d_semaphores, sizeof(semaphores));
    cudaMalloc(&d_indices, sizeof(indices));

    cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);

    add_kernel << <1, 6 >> > (d_canvas, d_indices, d_semaphores, nof_indices);

    cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);

    for (int it = 0; it < nof_indices; it++) {
        cout << canvas[it] << endl;


    return 0;

in this example the resulting array A should look like this : {1, 2 ,1 ,1,1,0} , but I only get it when I run kernel with dimensions << 6,1 >>.

I’m using cuda 12.1, geforce rtx3060

Thank you

when posting code on this forum, please format it properly. As a simple example, edit your post by clicking on the pencil icon underneath it. Then select your code. Then click the </> button at the top of the edit window, then save your changes.

Please do that now.

Is there a reason why you do not simply increment by 1 atomically?

__global__ void add_kernel(int* matrix, int* indices, int nof_indices)
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index < nof_indices) {
        int ind = indices[index];
        atomicAdd(matrix + ind, 1);

The deadlock probably arises because the threads in a warp are executed in lock-step. Threads do not exit the loop until the condition is true for all threads in the warp. But this can never be the case since the lock for element 1 will never be available for thread 1 or thread 5.

I think it should not deadlock if you compile for the architecture you are running on, eg. -arch=sm_86

Thanks Robert_Crovella, it works now!

works like a charm now