For the cp.async.bulk
instruction (from shared memory to global memory), we know it accumulates values from the shared memory (SMEM) of some CTAs into global memory. Is there any limitation on the number of CTAs involved?
For the cp.async.bulk
instruction (from shared memory to global memory), we know it accumulates values from the shared memory (SMEM) of some CTAs into global memory. Is there any limitation on the number of CTAs involved?