atomicCAS() doesn't work!

I am trying to use atomicCAS() to sum elements stored in global memory inside a kernel(the vector is the result of a parallel reduction…), but it doesnt work!

the code is easy:

[codebox]device int lock=0;

device_ float square_norm=0;

global mykernel(…){



square_norm += temp[0]






Do you find where is the problem?


This can deadlock due to warp-divergence and the code’s reliance on un-defined behavior.
All these have been discussed long long time back. Try searching…


don’t not use “lock=0” when fee lock.


atomicCAS(&lock,1,0)); // free lock

(2) if above modification does not work, then try to allocate lock outside the kernel.

Whou! I have found the answer(…maybe…) in a topic after 20 replies!

LSChien, thanks for your reply but I guess your code doesnt work, check the above topic.

Sarnath, PLEASE :( can you post the final WORKING code of a spinlock in CUDA? Many thanks!

Unfortunately, I dont have that code now… I think ‘tmurray’ posted it in that topic… Check out…
Its difficult and tiresome…You may need to spend 1 or 2 days to get it working. Good Luck!