Implementing H100 TMA multicast with cuda::ptx:: functions but its slower than 8 independent TMA operations fetching same tile in cluster

tugrul_192bit · December 20, 2025, 2:15pm

First, I initialized a mapped distributed shared memory:

// initialization of distributed shared memory barrier (mapped from one CTA)
if (threadIdx.x == 0 && blockRankInCluster == 0) {
    cuda::ptx::mbarrier_init(bar, 2); // tried 1 instead of 2 too
    cuda::ptx::fence_mbarrier_init(cuda::ptx::sem_release, cuda::ptx::scope_cluster);
    cuda::ptx::mbarrier_arrive_expect_tx(cuda::ptx::sem_release, cuda::ptx::scope_cluster, cuda::ptx::space_cluster, reinterpret_cast<uint64_t*>(bar), 1); // tried without this too
}
cluster.sync();

Then initiated TMA using tensor map:

if (threadIdx.x == 0 && blockRankInCluster == 0) {
    const uint16_t ctaMask = 0b1; // tried 0b11 
    cuda::ptx::cp_async_bulk_tensor(cuda::ptx::space_cluster, cuda::ptx::space_global, reinterpret_cast<void*>(s_mem), tensorMap, coords, reinterpret_cast<uint64_t*>(bar), ctaMask);                
}

After other asynchronous work, wait like this:

// cuda::ptx:: functions don't return any state. How can I know which state to start? Is state a token like non-cluster TMA operations use?
if (threadIdx.x == 0 && blockRankInCluster == 0) {
    while (!cuda::ptx::mbarrier_try_wait(cuda::ptx::sem_acquire, cuda::ptx::scope_cluster, reinterpret_cast<_CUDA_VSTD::uint64_t*>(bar), state))     {
    __nanosleep(10);
    };
}

but it doesn’t get out of this while-loop.

I used ctaMask = 0b1 in a two-block cluster launch (static define with cluster_dims(2, 1, 1)).

State is incremented after wait but its never reached due to infinite loop.

What can cause a try-wait to never return true value?

If this works, I will convert it to a multiple-CTA wait version to make synchronization more efficient. Currently only trying to run it correctly and have no idea how TMA knows the destinations for other CTAs smem. Can it infer their destinations from just the calling CTA’s destination parameter?

Tensor map encoding is correct, tile data on gmem is aligned, and works for non-cluster based TMA operations, distributed shared memory barrier and all the data pointers are aligned to 32 and 128 respectively. No error is returned. Its just running inside the while loop.

Also if someone can discuss the details about how mbarrier works, how a cluster initiates TMA from single and multiple CTAs for all CTAs, I appreciate. For example, more CTAs in cluster would require at least 1 arrival per CTA right? Then maybe try-wait per CTA instead of cluster.sync(). I guess arrive-count expected is (1 from TMA) +(1 for each thread calling try-wait).

I’m trying to do this because generating a single TMA request from single CTA and then distributing from that CTA to other CTAs in cluster (through distributed-smem copy manually) has a very limited bandwidth like 2-3 TB/s only (H100). Compared to the TMA bandwidth, its too low. So a multicast would make it potentially 5-10 times faster (or up to the peak shared-memory write bandwidth which is used as destination).

tugrul_192bit · December 20, 2025, 7:46pm

I completed the multi-cast like the following part but now it is slower than a normal TMA operation (non-multicast):

init:

if (threadIdx.x == 0 && blockRankInCluster == 0) {
    cuda::ptx::mbarrier_init(bar, 1);
    cuda::ptx::fence_mbarrier_init(cuda::ptx::sem_release, cuda::ptx::scope_cluster);
}
cluster.sync();

copy:

if (threadIdx.x == 0 && blockRankInCluster == 0) {
    const int32_t coords[2] = { x, y };
    const uint16_t ctaMask = 0b11111111;
    cuda::ptx::mbarrier_arrive_expect_tx(cuda::ptx::sem_release, cuda::ptx::scope_cluster, cuda::ptx::space_cluster, reinterpret_cast<uint64_t*>(bar), size);
    cuda::ptx::cp_async_bulk_tensor(cuda::ptx::space_cluster, cuda::ptx::space_global, reinterpret_cast<void*>(s_mem), tensorMap, coords, reinterpret_cast<uint64_t*>(bar), ctaMask);
}

wait:

if (threadIdx.x == 0 && blockRankInCluster == 0) {
    while (!cuda::ptx::mbarrier_try_wait(cuda::ptx::sem_acquire, cuda::ptx::scope_cluster, reinterpret_cast<_CUDA_VSTD::uint64_t*>(bar), state)) {
        __nanosleep(10);
    };
}

For example, in a non-multicast TMA operation where each CTA loads exact same global tile in same kernel, it has 8TB/s bandwidth while this multicast version has only about 2TB/s bandwidth which is even lower than a normal TMA (single CTA load) + manual broadcast to other CTAs using distributed smem.

Even without cluster.sync(), it is slower than non-multicast version.

Why is loading same tile 8 times not slower than loading only once(multicast)?

I checked incoming data, its correct for all CTAs of cluster. Just slower than independent TMA operations per CTA.

linzheng0428 · December 27, 2025, 8:07am

On H100, TMA multicast isn’t automatically a performance win over independent TMA.
Multicast requires cluster-level coordination and mbarrier costs, which can outweigh the benefit if the transfer size or reuse pattern doesn’t amortize that overhead.

tugrul_192bit · December 27, 2025, 8:56am

Thank you Linzheng,

I was using single-source TMA (1 CTA to other CTAs in cluster for whole tile of data). I guess just 1 CTA can’t initiate other blocks TMA hardware to do this collectively. And probably loses the benefits of their interconnect bandwidth within cluster.

I mean, even if single TMA hardware usage isn’t bottleneck, maybe (not using all of ) cluster-interconnect is. I had assumed that nvidia unified that part too, similar to how warp-shuffle and smem throughput are unified.

linzheng0428 · December 28, 2025, 4:56am

That’s a reasonable intuition. In practice, a single “producer CTA” can still become the bottleneck — multicast doesn’t increase the issue rate, and you still pay a single TMA issue plus the cluster distribution/sync cost.
When the per-tile transfer is small (or coordination dominates), it’s not unusual to see higher aggregate throughput from multiple independent TMA calls.

Nanodeoclus · January 5, 2026, 10:35am

I just want to point out what seems to me to be a bug/misunderstanding in the code. You wrap the mbarrier operations in if (threadIdx.x == 0 && blockRankInCluster == 0), but when using multicast, each CTA in the cluster should use its own local mbarrier (see reference to PTX documentation below), since both data and mbarrier signal are multicast. So the mbarrier_init and mbarrier_arrive_expect_tx calls should happen in all CTAs, only cp_async_bulk_tensor with multicast should be limited to blockRankInCluster == 0.

The optional modifier .multicast::cluster allows copying of data from global memory to shared memory of multiple CTAs in the cluster. Operand ctaMask specifies the destination CTAs in the cluster such that each bit position in the 16-bit ctaMask operand corresponds to the %ctaid of the destination CTA. The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA. The mbarrier signal is also multicast to the same CTA-relative offset as mbar in the shared memory of the destination CTA.

tugrul_192bit · January 5, 2026, 10:46am

Yes. Later I converted cuda::device::barrier_arrive_tx (and barrier.wait with token) to be called from each cta. Using single CTA to wait and then cluster-wide sync to get other CTAs in sync was a bad idea. But it didn’t change performance much. Still requires interleaved multicast to use full bandwidth of interconnect or to reduce overall latency of cluster multicast.

Nanodeoclus · January 5, 2026, 11:48am

It can be tricky to benchmark this correctly as well. Cluster multicast should help you get higher overall bandwidth if you are getting close to the L2/HBM bandwidth limit in non-multicast situations. I can confirm that, for large transfers, pipelined and measured over a significant amount of time, non-multicast cp_async_bulk tops out at ~38B/cycle/SM on 80GB/132SM variants of H100, i.e. just under 5120B per cycle GPU-wide from L2. (Note that the 94GB cards have more L2 cache and bandwidth). Whereas with 4x cluster multicast you can get ~70B/cycle/SM. So it’s no longer maxing out L2 bandwidth and is limited by intra-GPC bandwidth instead. I’ve not seen an improvement going from 4x to 8x. (Note that cluster size 4 or 8 can no longer use all SMs)

However, this is throughput/bandwidth, not latency. One thing that’s less intuitive is that multicast will have higher latency, especially average latency.
If you’re not mulitcasting, the different CTAs don’t necessarily have to be in sync with each other. So if they’re individually loading the same block of data at slightly different times, what will happen is that one of them will arrive at the load first, and incur the latency penalty of any L2 cache miss. But when the other CTAs come around to loading the same data it might already be in L2, so their latency, as observed locally, can be much lower. Whereas a single multicast load will get all the same cache misses as the first non-multicast load. It’s thus important to pipeline your loads sufficiently to cover the worst case latency.

Topic		Replies	Views
How to use PTX mbarrier.try_wait in cluster? CUDA Programming and Performance	2	325	September 21, 2025
In hopper, can TMA broadcast data to multiple blocks? CUDA Programming and Performance	1	131	July 23, 2024
What is the start marker of the scope containing "async_bulk" in "ptx::cp_async_bulk_commit_group()"? CUDA Programming and Performance cuda	0	125	September 11, 2024
Is cluster scope necessary when only CTA0 is performing the try_wait? CUDA Programming and Performance	0	69	November 5, 2024
TMA async bulk tensor copy memory consistency CUDA Programming and Performance	0	1123	April 25, 2024
Nsight Compute hangs on InstructionStats / WarpStateStats when profiling TMA + mbarrier kernels on Blackwell Nsight Compute	2	80	May 11, 2026
Cutlass vs FA Cluster Sync: Overhead & Correctness CUDA Programming and Performance	0	61	June 11, 2025
Behavior of TMA Store and Wait Mechanism in CUTLASS CUDA Programming and Performance	0	112	December 19, 2024
Possible race condition in TMA examples CUDA Programming and Performance	3	518	August 7, 2024
Re: Cooperative groups are much slower than CUB CUDA Programming and Performance	1	124	December 16, 2024

Implementing H100 TMA multicast with cuda::ptx:: functions but its slower than 8 independent TMA operations fetching same tile in cluster

Related topics