Problem with lock using atomicCAS

Would you be kind to point what’s going wrong in this code ?

/*

Each member of the warp try to lock different location of the memory. In this code, there are only 32 critical locations. No two threads of the same warp fight for the same location. But threads from different warp fight for the same location, determined by thread.

nvcc -arch sm_35 main.cu

*/

#include "stdio.h"
__device__ void lock_me(int* mutex, int id) {

          while (atomicCAS((int*) (mutex + id), 0, 1) != 0);
}

__device__ void unlock_me(int* mutex, int id) {
    atomicExch((int*) (mutex + id), 0);
}

__global__ void test_lock(int *a) {

    int laneId = threadIdx.x & 0x1f;

    lock_me(a, laneId); //lock
    unlock_me(a, laneId); //unlock	
}

int main(void) {

    // Array of locks
    int *my_locks;
    // allocate  memory for the locks on  GPU
    cudaMalloc((void**) &my_locks, 32 * sizeof (int));
    //initialize memory	for the locks
    cudaMemset(my_locks, 0, 32 * sizeof (int));

    int num_block = 1;
    int thread_per_block = 32;

    for (num_block = 1; num_block <= 128; num_block++) {

        printf("Launching %d threads\n", (num_block * thread_per_block));

        /*
        cudaEvent_t event_start, event_end;
        cudaEventCreate(&event_start);
        cudaEventCreate(&event_end);
        cudaEventRecord(event_start, 0);
         */

        test_lock << <num_block, thread_per_block>>>(my_locks);

cudaDeviceSynchronize();
        /*	
        cudaEventRecord(event_end, 0);
        //synchronize
        cudaEventSynchronize(event_start); //optional
        cudaEventSynchronize(event_end); //wait for the event to be executed!
        cudaEventDestroy(event_start);
        cudaEventDestroy(event_end);
         */
    }

    // free the memory allocated on the GPU
    cudaFree(my_locks);

    return 0;
}

Having multiple threads in the same warp compete for locks, even if they are not the same lock, is a recipe for trouble. This is because warps are executed in lock step, and as a result of the behavior of the SIMT architecture in the presence of warp-divergent code.

Let’s suppose all threads in block zero are attempting to acquire their locks. Let’s also suppose that all threads in block 1 are attempting to acquire their locks (which are the same set of locks). Let’s suppose threads 0-15 in block 0 are successful, and threads 16-31 in block 1 are also successfull. The warp associated with block zero will spin in the while loop associated with lock_me, and the warp associated with block 1 will also spin in that portion of the code, in both cases waiting for the other half of the threads to acquire their locks. Since niether warp can proceed to the unlock_me code until all 32 threads in the warp have acquired their lock, and this can’t happen until the other warp releases their locks (which can’t happen until they all acquire…) you have deadlock.

This is generally dangerous and troublesome programming anyway, so I hesitate to suggest this, but you can release the deadlock by allowing the individual threads to acquire and release their locks independently, like this:

__device__ int lock_me(int* mutex, int id) {

if (atomicCAS((int*) (mutex + id), 0, 1) == 0)
  return 1;
return 0;
}

__device__ void unlock_me(int* mutex, int id) {
atomicExch((int*) (mutex + id), 0);
}

__global__ void test_lock(int *a) {

int laneId = threadIdx.x & 0x1f;
int successfull = 0;
while (!successfull){
  if (lock_me(a, laneId)) { //lock acquired?
    unlock_me(a, laneId); // then unlock
    successfull = 1;}
  }

}

Let me be clear: I’m not saying what I have posted is bullet-proof code. I am saying for the test case you have posted, it seems to avoid the deadlock case I am outlining.

Note that the above has been shown to be probably not robust. See the discussion here:

https://devtalk.nvidia.com/default/topic/1037511/cuda-programming-and-performance/problem-of-hash-table-lock-in-cuda/

Thanks a lot. You just saved my weeks.

.