questions about using atomicCAS as a lock

Hello all,

My name is Wei-Fan. I am a new CUDA programmer learner.
I am now practicing using atomicCAS as a lock of global memory.
But there are two cases I just can’t find out the problems…

case 1:

global adder (int *buff) {
int idx = threadIdx.x;
int lock = -2;
int old;
do {
old = buff[idx];
if (old == lock) continue;
} while (old != atomicCAS(buff[idx], old, lock));
buff[idx] = old + 1;
}

In this case, I have 1 dimension block and 1 dimension grid. Each block has 256 threads. Each grid have 64 blocks.
Therefore, ideally, each element in buff should be 64. However, the result shows that there are some races so not every element is 64.

case 2:

global adder (int *buff) {
int idx = threadIdx.x;
int lock = -2;
int old;
do {
do {
old = buff[idx];
} while (old == lock);
} while (old != atomicCAS(buff[idx], old, lock));
buff[idx] = old + 1;
}

The case 2 is basically just a modified version of case 1. But this one works worst!!
It seems that there is a deadlock or something else in this code that I wait a long time but the program wasn’t terminated… Why?

I know that I can implement “adder” by atomicAdd or atomicInc. Or even

do {
old = buff[idx];
} while (old != atomicCAS(buff[ids], old, (old+1)));

(I did this one. It works good.)

I just wonder that “what I did wrong in case 1 and case 2?”

Thanks.

Regards,

  • Wei-Fan