Confusing results while using atomicCAS() on shared variables

I am trying to write a concurrent data structure on the GPU and I have to implement critical section to enable mutual exclusion. So, I wrote
the following code that uses atomicCAS() (alternatively, one can use atomicExch()).

shared (or device) block_work, lock;

device void criticalSectionCAS(int node){

volatile int lockstate=0;
while ((lockstate=atomicCAS(&lock, 0, 1)) == 0);

// atomicAdd(&block_work, 1);

lock = 0;

Both block_size and lock are initialized as follows (I am using 128 threads on a single block.)

global void test_sync_kernel(){

        int iam = blockIdx.x*blockDim.x+threadIdx.x;

        if (threadIdx.x == 0){
                  lock =0;



criticalSectionCAS(iam) …


I am printing the value of block_work. For 128 threads, it should be 128.

However, for the device-based shared variable, when I use block_work++, I get the value of block_work as 3.
When I used atomicAdd(&block_work, 1), I get 96. The values change if I make block_work as shared (e.g., I get a value of 4).


  1. Why do I need to use atomicAdd() to update a shared variable in the critical section?
  2. The wrong values seem to be related to warp size. How do the threads within a warp execute the atomicCAS()? I am using thread-local volatile variable. I also tried more complex patterns such as

node = blockIdx.x*blockDim.x+threadIdx.x;
volatile int compare= (node % 32), lockstate=0;
volatile int val = ((node+1) % 32);
while ((lockstate=atomicCAS(&lock, compare, val)) == compare);

where every thread in a warp uses an unique pair for compare and swap. This should result in serialization within the warp. But, this didnt help either.

I am doing something wrong. Can someone help me?


I dont know if this is related…

There is at least 1 compiler bug related to atomics on shared memory variables…

Just paste the following search keyword on google “Shared mem atomics sarnath cvnguyen sylvain site:
and use the first link.