In PTX documentation, the instruction cp.reduce.async.bulk
is mentioned. I have a scenario where I need to reduce data from two blocks in shared memory (SMEM) to global memory. Specifically, I want the corresponding positions of the two blocks to be reduced into one block, similar to the behavior of splitK
, where multiple blocks contribute to a single block in global memory.
- Can this be achieved using
cp.reduce.async.bulk
? - Is there an upper limit on the number of blocks that can participate in this reduction? For example, I am aware that similar instructions, like those involving clusters, have size limitations. Does
cp.reduce.async.bulk
support reduction across an arbitrary number of blocks, or is there a block size limit?