I am trying to write a concurrent data structure on the GPU and I have to implement critical section to enable mutual exclusion. So, I wrote
the following code that uses atomicCAS() (alternatively, one can use atomicExch()).
shared (or device) block_work, lock;
device void criticalSectionCAS(int node){
volatile int lockstate=0;
while ((lockstate=atomicCAS(&lock, 0, 1)) == 0);
block_work++;
// atomicAdd(&block_work, 1);
lock = 0;
}
Both block_size and lock are initialized as follows (I am using 128 threads on a single block.)
global void test_sync_kernel(){
int iam = blockIdx.x*blockDim.x+threadIdx.x;
if (threadIdx.x == 0){
block_work=0;
lock =0;
}
}
__syncthreads();
criticalSectionCAS(iam) …
}
I am printing the value of block_work. For 128 threads, it should be 128.
However, for the device-based shared variable, when I use block_work++, I get the value of block_work as 3.
When I used atomicAdd(&block_work, 1), I get 96. The values change if I make block_work as shared (e.g., I get a value of 4).
Questions:
- Why do I need to use atomicAdd() to update a shared variable in the critical section?
- The wrong values seem to be related to warp size. How do the threads within a warp execute the atomicCAS()? I am using thread-local volatile variable. I also tried more complex patterns such as
node = blockIdx.x*blockDim.x+threadIdx.x;
volatile int compare= (node % 32), lockstate=0;
volatile int val = ((node+1) % 32);
while ((lockstate=atomicCAS(&lock, compare, val)) == compare);
where every thread in a warp uses an unique pair for compare and swap. This should result in serialization within the warp. But, this didnt help either.
I am doing something wrong. Can someone help me?
Thanks!
Rajesh