A problem of implementing mutex in CUDA

zzzzzq · June 25, 2017, 5:10pm

Hello! I am doing something involving implementing a mutex in CUDA. I followed this method synchronization - Implementing a critical section in CUDA - Stack Overflow
And the only difference is that I must put the mutex in a for loop.
The original author said following code can work:

__global__ void k_testLocking(unsigned int* locks, int n) {
    int id = threadIdx.x % n;
    bool leaveLoop = false;
    while (!leaveLoop) {
        if (atomicExch(&(locks[id]), 1u) == 0u) {
            //critical section
            leaveLoop = true;
            atomicExch(&(locks[id]),0u);
        }
    } 
}

I put the code block into a for loop so my code is like this:

__global__ void k_testLocking(unsigned int* locks, int n, unsigned int* array) {
    int id = threadIdx.x % n;
    for( int i=array[id]; i<array[id+1]; i++ ){
        bool leaveLoop = false;
        while (!leaveLoop) {
            if (atomicExch(&(locks[i]), 1u) == 0u) {
                //critical section
                leaveLoop = true;
                atomicExch(&(locks[i]),0u);
            }
        } 
    }
}

But my code does not work. It seems to meet some dead lock. Can anyone tell me what’s wrong with my code please?

cbuchner1 · June 26, 2017, 9:39am

assuming the original code works as intended…

did you preinitialize the global memory pointed to by locks with cudaMemset to all zero?

Christian

zzzzzq · June 26, 2017, 12:19pm

Yes, I did. If I run the original code, it works fine, but when I put it into the for loop it seems to be a dead lock. I can’t figure out where the problem is.

tera · June 26, 2017, 12:43pm

It might have something to dp with the placement of reconvergence points by the compiler.
As the compiler can’t predict the contents of array, it might place the convergence point only after the the outer loop.
I haven’t checked if this could lead to some condition where the inner loop wouldn’t terminate.

EternalSaga · June 26, 2017, 2:15pm

What a would like to say is a little bit off topic.
The essence of the lock implementation is make all threads in a threads warp to run in serialization which lead to extremely low hardware utilization. I’m still try to find some more efficient way to implement lock.

MutantJohn · June 26, 2017, 4:53pm

Try to find a way to design your algorithm to not need a lock.

GPU-based parallelism differs from CPU-based parallelism in that any form of locking is incredibly punitive and should be avoided like the plague.

Jimmy_Pettersson · June 29, 2017, 3:10pm

Write lock-free code or bust! Never really seen examples of well-performing or even robust/working mutex-locks that work within CUDA/OpenCL.

Synchronize at block level or kernel level if possible.

Topic		Replies	Views
Try to use lock and unlock in CUDA CUDA Programming and Performance	1	19279	June 14, 2017
How to implement lock on the gpu? CUDA Programming and Performance	5	5121	January 19, 2010
How to implement a lock? CUDA Programming and Performance	2	18313	March 13, 2011
Mutex problem problem with global mutex CUDA Programming and Performance	19	11204	November 17, 2010
atomicCAS mutex not working on 2080ti? CUDA Programming and Performance	3	1043	November 30, 2019
atomic locks CUDA Programming and Performance	15	12910	January 27, 2012
dead lock in cuda 7.5 CUDA Programming and Performance	1	442	March 14, 2017
Question regarding CUDA streams CUDA Programming and Performance	4	2473	May 21, 2009
Problem with lock using atomicCAS CUDA Programming and Performance	3	3540	July 19, 2014
How to pass a struct to a kernel? CUDA Programming and Performance	8	2340	March 19, 2019

A problem of implementing mutex in CUDA

Related topics