I am trying to use atomicCAS() to sum elements stored in global memory inside a kernel(the vector is the result of a parallel reduction…), but it doesnt work!
This can deadlock due to warp-divergence and the code’s reliance on un-defined behavior.
All these have been discussed long long time back. Try searching…
GiulioPU,
Unfortunately, I dont have that code now… I think ‘tmurray’ posted it in that topic… Check out…
Its difficult and tiresome…You may need to spend 1 or 2 days to get it working. Good Luck!