The PTX ISA document doesn’t give much detail on the ‘red’ PTX instruction. From the architecture overview and the similarities to the ‘atom’ instruction (as well as a fair bit of hope), I gather that the reduction operation is actually carried out on the memory controller using the same hardware as the ‘atom’ instruction, rather than in an SPU, such that an SPU can issue a ‘red’ request to the memory controller and resume doing its thing, but that’s a guess. Does anyone know:
- Is it atomic? More specifically, on both CL 1.1 and CL 1.3 hardware,
1a. If 32 threads, each from separate warps, issue a ‘red.global.inc.u32’ instruction on the same (initially zero) location in global memory, is the value at that location guaranteed to be 32?
1b. If all 32 threads in the same warp issue a ‘red.global.inc.u32’ instruction on the same (initially zero) location in global memory, is the value at that location guaranteed to be 32?
In normal operation, is the cost of issuing 32 ‘red.global.add.u32’ write requests from a warp, each on different regions of memory (no coalescing), on the order of ‘st.global.u32’, on the order of ‘atom.global.add.u32’, or somewhere in between?
Is it possible to flood the memory controller request queue with these operations, such that the answer to the above changes?
Edit: Including the word reduction in the message body to help with searches - ‘red’ is too small a word to be searched.