Implementing mutual exclusion lock using atomicCAS()

rajesh · August 4, 2009, 1:56pm

I am trying to implement basic concurrent queue either in shared or device memory. I have implemented blocking enqueue() and dequeue() operations using a single mutual exclusion lock on the queue.
I am using CUDA 2.2 on GTX 285.

shared int lock;

global void kernel(){

// Assume a one-dimensional thread mapping
// First thread in every block sets the lock value to 0
if (threadIdx.x == 0){
lock =0;
}
}

void enqueue(){
while (atomicCAS(&lock, 0, 1) == 1);
// insert an item

// release the lock
lock=0;
threadfence_block();

// should I use atomicExch() instead?
}

void dequeue(){
while (atomicCAS(&lock, 0, 1) == 1);
// remove an item

// release the lock
lock=0;
threadfence_block();

// should I use atomicExch() instead?
}

The kernel seems to hang and then returns with error. Can someone tell me what may be wrong?

Thanks!
Rajesh

rajesh · August 4, 2009, 2:59pm

Sorry I found the mistake…

the while loop should be

while (atomicCAS(&lock, 0, 1) == 0);

Sarnath · August 5, 2009, 5:28am

Using such spinning “while” loops cause warp-divergence… Since it is NOT documented which sub-warp will be scheduled, it usually causes deadlock… The spinning sub-warp is schedueld agian and again and the winning thread never gets control…

It is dangerous to use such loops.

Paste the following search keyword in google “atomicCAS Sarnath Tmurray site:http://forums.nvidia.com”
and choose the first link (I am feeling lucky)

Also, Please follow Tmurray’s advice on how to proceed in a granualar fashion while obtaining the lock. i.e. 1 thread per block (usually threadIdx.x == 0) participates in a block-level lock contention. One block gets control while others spin… The threads among winning block will fight for a shared memory lock and ultimately one thread will win… If you allow all threads to compete, your atomic functions will make your kernel many times slower than CPU.