How to Use cp.reduce.async.bulk to Perform Block-Level Reduction to Global Memory?

In PTX documentation, the instruction cp.reduce.async.bulk is mentioned. I have a scenario where I need to reduce data from two blocks in shared memory (SMEM) to global memory. Specifically, I want the corresponding positions of the two blocks to be reduced into one block, similar to the behavior of splitK, where multiple blocks contribute to a single block in global memory.

  1. Can this be achieved using cp.reduce.async.bulk?
  2. Is there an upper limit on the number of blocks that can participate in this reduction? For example, I am aware that similar instructions, like those involving clusters, have size limitations. Does cp.reduce.async.bulk support reduction across an arbitrary number of blocks, or is there a block size limit?