Nsight Compute hangs on InstructionStats / WarpStateStats when profiling TMA + mbarrier kernels on Blackwell

Summary

Nsight Compute hangs at 0% on the first instrumented replay pass when collecting InstructionStats or WarpStateStats sections for a kernel that uses cp.async.bulk with mbarrier.try_wait.parity on B200 (SM 100). All other sections complete successfully.

Environment

Component Version
GPU NVIDIA B200 (SM 10.0 / compute_100)
NCU 2025.3.1.0 (build 36398880)
CUDA 13.0 (V13.0.88)
Driver 570.158.01

Sections affected

  • OK: SpeedOfLight, MemoryWorkloadAnalysis, ComputeWorkloadAnalysis, Occupancy, SchedulerStats, SourceCounters, PmSampling, WorkloadDistribution, LaunchStats
  • HANG: InstructionStats, WarpStateStats

Reproducer

A self-contained ~200-line repro is attached (repro.cu): ncu_tma_repro.zip (4.5 KB). It implements a double-buffered global-to-shared-memory pipeline using TMA:

  1. Thread 0 calls mbarrier_arrive_expect_tx with total expected bytes
  2. Thread 0 issues 4x cp.async.bulk into the alternate shared memory buffer
  3. All threads spin on mbarrier_try_wait_parity
  4. All threads read from the current buffer, accumulate results
  5. Swap buffers, repeat

The kernel uses cuda::ptx:: wrappers from CCCL/libcu++ (no hand-written inline PTX).

nvcc -arch=sm_100a -o repro repro.cu

./repro                                                    # PASS
ncu --section SpeedOfLight --launch-count 1 ./repro        # OK
timeout 60 ncu --section InstructionStats --launch-count 1 ./repro  # HANG
timeout 60 ncu --section WarpStateStats --launch-count 1 ./repro   # HANG

Has anyone encountered this issue? Are there known workarounds to collect warp state or instruction-level metrics on TMA + mbarrier kernels on Blackwell?

Hi, @aidan23

Thanks for reporting this to us. We can reproduce this and have submitted a internal ticket for tracking.
Any update, I will let you know.

Hi, @aidan23

Our dev helps to check this issue. And this turns out a sample issue.

From dev:

I found one serious hang candidate in repro.cu:90: the mbarrier is initialized with count `1`, and only thread 0 arrives for each phase, but all CTA threads wait on the phase parity.
That means thread 0 can observe phase completion, start the next `mbarrier_arrive_expect_tx` at repro.cu:127, and let that next phase complete before slower threads have exited the previous `mbarrier_try_wait_parity` at repro.cu:113 or repro.cu:152. Since parity only has one bit, a delayed thread can miss a phase transition and wait forever. The `__syncthreads()` at repro.cu:151 happens after thread 0 has already launched the next mbarrier phase, so it is too late to prevent this.
Minimal hardening:
```cpp
while (!ptx::mbarrier_try_wait_parity(
ptx::sem_acquire, ptx::scope_cta, mbar, parity, 0x989680u)) {}
parity ^= 1;
__syncthreads(); // prevent thread 0 from reusing the parity before all waiters observed it
```
Add that after the prologue wait and after the inner-loop wait. A more canonical fix is to initialize the mbarrier with `blockDim.x` and have every thread call `mbarrier_arrive_expect_tx`, with `tx_count = total_tx` only for thread 0 and `0` for the rest. That prevents the producer from lapping waiters without adding as much extra synchronization.
I did not see an obvious shared-memory bounds bug for the current host parameters: `dim_padded` makes both tiles full, the buffer offsets are disjoint, and the dynamic shared allocation covers `SMEM_BYTES + sizeof(uint64_t)`. The main correctness risk is the phase-lapping race above, which could plausibly show up only under NCU SASS patching because instrumentation changes timing in the wait loop.