atomicCAS() doesn't work!

I am trying to use atomicCAS() to sum elements stored in global memory inside a kernel(the vector is the result of a parallel reduction…), but it doesnt work!

the code is easy:

[codebox]device int lock=0;

device_ float square_norm=0;

global mykernel(…){

if(tid==0){

do{}while(atomicCAS(&lock,0,1));//setlock

square_norm += temp[0]

__threadfence();//waitforwritecompletion

lock=0;//freelock

}

}

[/codebox]

Do you find where is the problem?

Thanks

This can deadlock due to warp-divergence and the code’s reliance on un-defined behavior.
All these have been discussed long long time back. Try searching…

(1)

don’t not use “lock=0” when fee lock.

try

atomicCAS(&lock,1,0)); // free lock

(2) if above modification does not work, then try to allocate lock outside the kernel.

Whou! I have found the answer(…maybe…) in a topic after 20 replies!

http://forums.nvidia.com/index.php?showtopic=98444

LSChien, thanks for your reply but I guess your code doesnt work, check the above topic.

Sarnath, PLEASE :( can you post the final WORKING code of a spinlock in CUDA? Many thanks!

GiulioPU,
Unfortunately, I dont have that code now… I think ‘tmurray’ posted it in that topic… Check out…
Its difficult and tiresome…You may need to spend 1 or 2 days to get it working. Good Luck!