'red' PTX instruction Performance against 'atom'

The PTX ISA document doesn’t give much detail on the ‘red’ PTX instruction. From the architecture overview and the similarities to the ‘atom’ instruction (as well as a fair bit of hope), I gather that the reduction operation is actually carried out on the memory controller using the same hardware as the ‘atom’ instruction, rather than in an SPU, such that an SPU can issue a ‘red’ request to the memory controller and resume doing its thing, but that’s a guess. Does anyone know:

  1. Is it atomic? More specifically, on both CL 1.1 and CL 1.3 hardware,

1a. If 32 threads, each from separate warps, issue a ‘red.global.inc.u32’ instruction on the same (initially zero) location in global memory, is the value at that location guaranteed to be 32?

1b. If all 32 threads in the same warp issue a ‘red.global.inc.u32’ instruction on the same (initially zero) location in global memory, is the value at that location guaranteed to be 32?

  1. In normal operation, is the cost of issuing 32 ‘red.global.add.u32’ write requests from a warp, each on different regions of memory (no coalescing), on the order of ‘st.global.u32’, on the order of ‘atom.global.add.u32’, or somewhere in between?

  2. Is it possible to flood the memory controller request queue with these operations, such that the answer to the above changes?

Thanks!
Steven

Edit: Including the word reduction in the message body to help with searches - ‘red’ is too small a word to be searched.

From my understanding, red is to be called when you want to perform an atomic operation but don’t care about the (intermediate) result. In theory, this would allow the compiler to change the order of operations, e.g. by using a reduction tree or even specialized hardware (“horizontal add/min/max” within a warp would be nice…)

On CUDA 2.0 and GT200, red is implemented exactly the same way as atom (exact same instructions, as checked with Decuda).
I think/hope its introduction in CUDA 2.0 is an hint on the capabilities of some future GPU. ;)