A problem of implementing mutex in CUDA

Hello! I am doing something involving implementing a mutex in CUDA. I followed this method synchronization - Implementing a critical section in CUDA - Stack Overflow
And the only difference is that I must put the mutex in a for loop.
The original author said following code can work:

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    bool leaveLoop = false;
    while (!leaveLoop) {
        if (atomicExch(&(locks[id]), 1u) == 0u) {
            //critical section
            leaveLoop = true;
            atomicExch(&(locks[id]),0u);
        }
    } 
}

I put the code block into a for loop so my code is like this:

__global__ void k_testLocking(unsigned int* locks, int n, unsigned int* array) {
    int id = threadIdx.x % n;
    for( int i=array[id]; i<array[id+1]; i++ ){
        bool leaveLoop = false;
        while (!leaveLoop) {
            if (atomicExch(&(locks[i]), 1u) == 0u) {
                //critical section
                leaveLoop = true;
                atomicExch(&(locks[i]),0u);
            }
        } 
    }
}

But my code does not work. It seems to meet some dead lock. Can anyone tell me what’s wrong with my code please?

assuming the original code works as intended…

did you preinitialize the global memory pointed to by locks with cudaMemset to all zero?

Christian

Yes, I did. If I run the original code, it works fine, but when I put it into the for loop it seems to be a dead lock. I can’t figure out where the problem is.

It might have something to dp with the placement of reconvergence points by the compiler.
As the compiler can’t predict the contents of array, it might place the convergence point only after the the outer loop.
I haven’t checked if this could lead to some condition where the inner loop wouldn’t terminate.

What a would like to say is a little bit off topic.
The essence of the lock implementation is make all threads in a threads warp to run in serialization which lead to extremely low hardware utilization. I’m still try to find some more efficient way to implement lock.

Try to find a way to design your algorithm to not need a lock.

GPU-based parallelism differs from CPU-based parallelism in that any form of locking is incredibly punitive and should be avoided like the plague.

Write lock-free code or bust! Never really seen examples of well-performing or even robust/working mutex-locks that work within CUDA/OpenCL.

Synchronize at block level or kernel level if possible.