In CUTLASS, there is a tma_store_wait
function, which corresponds to cp.async.bulk.wait_group.read
. Based on my observations while working with TMA, it seems that after completing a TMA-store operation, waiting is not necessary. It appears to behave like expect_tx
, where the operation seems to complete automatically.