Questions about TMA reduce


I got the SASS of ‘cp.reduce.async.bulk…’, and I got ‘UBLKRED’. Then I noticed here that ‘UBLKRED’ is a copy with reduction. Say I have an array of numbers in shared memory of GPU0, and I use this instruction on it to copy and reduce to an another arrary on GPU1’s HBM, where does the reduction actually happen? In TMA of GPU0/GPU1/some SMs?