I am trying to use atomicCAS() to sum elements stored in global memory inside a kernel(the vector is the result of a parallel reduction…), but it doesnt work!
the code is easy:
[codebox]device int lock=0;
device_ float square_norm=0;
square_norm += temp
Do you find where is the problem?
This can deadlock due to warp-divergence and the code’s reliance on un-defined behavior.
All these have been discussed long long time back. Try searching…
don’t not use “lock=0” when fee lock.
atomicCAS(&lock,1,0)); // free lock
(2) if above modification does not work, then try to allocate lock outside the kernel.
Whou! I have found the answer(…maybe…) in a topic after 20 replies!
LSChien, thanks for your reply but I guess your code doesnt work, check the above topic.
Sarnath, PLEASE :( can you post the final WORKING code of a spinlock in CUDA? Many thanks!
Unfortunately, I dont have that code now… I think ‘tmurray’ posted it in that topic… Check out…
Its difficult and tiresome…You may need to spend 1 or 2 days to get it working. Good Luck!