Nsight Compute hangs at 0% on the first instrumented replay pass when collecting InstructionStats or WarpStateStats sections for a kernel that uses cp.async.bulk with mbarrier.try_wait.parity on B200 (SM 100). All other sections complete successfully.
A self-contained ~200-line repro is attached (repro.cu): ncu_tma_repro.zip (4.5 KB). It implements a double-buffered global-to-shared-memory pipeline using TMA:
Thread 0 calls mbarrier_arrive_expect_tx with total expected bytes
Thread 0 issues 4x cp.async.bulk into the alternate shared memory buffer
All threads spin on mbarrier_try_wait_parity
All threads read from the current buffer, accumulate results
Swap buffers, repeat
The kernel uses cuda::ptx:: wrappers from CCCL/libcu++ (no hand-written inline PTX).
Has anyone encountered this issue? Are there known workarounds to collect warp state or instruction-level metrics on TMA + mbarrier kernels on Blackwell?
Our dev helps to check this issue. And this turns out a sample issue.
From dev:
I found one serious hang candidate in repro.cu:90: the mbarrier is initialized with count `1`, and only thread 0 arrives for each phase, but all CTA threads wait on the phase parity.
That means thread 0 can observe phase completion, start the next `mbarrier_arrive_expect_tx` at repro.cu:127, and let that next phase complete before slower threads have exited the previous `mbarrier_try_wait_parity` at repro.cu:113 or repro.cu:152. Since parity only has one bit, a delayed thread can miss a phase transition and wait forever. The `__syncthreads()` at repro.cu:151 happens after thread 0 has already launched the next mbarrier phase, so it is too late to prevent this.
Minimal hardening:
```cpp
while (!ptx::mbarrier_try_wait_parity(
ptx::sem_acquire, ptx::scope_cta, mbar, parity, 0x989680u)) {}
parity ^= 1;
__syncthreads(); // prevent thread 0 from reusing the parity before all waiters observed it
``` Add that after the prologue wait and after the inner-loop wait. A more canonical fix is to initialize the mbarrier with `blockDim.x` and have every thread call `mbarrier_arrive_expect_tx`, with `tx_count = total_tx` only for thread 0 and `0` for the rest. That prevents the producer from lapping waiters without adding as much extra synchronization.
I did not see an obvious shared-memory bounds bug for the current host parameters: `dim_padded` makes both tiles full, the buffer offsets are disjoint, and the dynamic shared allocation covers `SMEM_BYTES + sizeof(uint64_t)`. The main correctness risk is the phase-lapping race above, which could plausibly show up only under NCU SASS patching because instrumentation changes timing in the wait loop.