I have been working on similar problems for the last few months.
I have programmed a few “hacks” that work for atomic computations at block level.
However the streaming architecture of the GPU is not designed for such constructs leading to possible dead locks!
I have already posted ways of doing this and is achieved by spin - loops + global writes!
http://forums.nvidia.com/index.php?showtopic=44144
Read from Memory
Work in parallel
Reduce in parallel (for threads within a single MP)
Reduce serially using the modified programming constructs.
Reduction + Block level synchronization + Memory optimization = very high performance gains
In short,Use the constructs as tools of getting around the problem not as a concrete reference!
I hope this helps
Cheers,
Neeraj