Understanding the CTA-local Requirements of fence.proxy.async, as documented by "Mixed-proxy extensions for the NVIDIA PTX memory consistency model"

dza724 · January 21, 2026, 11:24pm

I’m researching safe (safer) programming languages for NVIDIA GPUs, and I have some questions on reconciling the official PTX proxy fence documentation with the paper “Mixed-proxy extensions for the NVIDIA PTX memory consistency model”.

The surface-level question is, when synchronizing a generic proxy write in one CTA with an async proxy read in another CTA (in the same cluster), should the fence.proxy.async be before or after the cluster sync? The paper describes that the fence must be after the cluster sync, which leads to my deeper question, why do the PTX docs imply this requirement? (I must understand why, not merely know what to do in a specific case, to be able to model what safety means in a programming language). The interaction with Blackwell tcgen05 is also unclear.

Question 1/3: Figure 8e has a caption “When synchronizing threads are in different CTAs, the proxy fence must be inserted in the CTA containing the non-generic operation”. However the code listing seems to contradict this, showing the fence.proxy.constant instruction in thread 0 (w/ st.global – generic proxy), not in thread 1 (w/ ld.const – constant proxy). Is this an error in the paper?

Question 2/3: Section 5.4 states “The proxy fence must also be inserted in the same CTA where the non-generic access is taking place”. How is this requirement established by the PTX documentation? I presume that “proxy-preserved base causality order” is part of this, since this is where the “same thread block” distinction for non-generic proxies is established. From there I’m lost as to how a proxy fence instruction is able to interact with causality order. If X is a generic proxy write in CTA 0, Y is an async proxy read in CTA 1, and Z is a fence.proxy.async, then the causality order requirements require that

“For some operation Z [proxy fence?], X [generic write] precedes Z in observation order, and Z precedes Y [async read] in proxy-preserved base causality order”

but observation order is defined for reads, writes, and atomics only, and the fence isn’t any of these (or my deduction that a fence.proxy.async is the “some operation Z” needed here is incorrect).

The rules are clear enough when a write/read pair is done be the same proxy (with the same virtual address and same CTA). But if I have a write/read pair in different proxies, how do the PTX rules establish that the addition of a proxy fence makes this safe (when it otherwise would have been unsafe)?

Question 3/3: How does this interact with Blackwell tcgen05? If we do a write to SMEM in the generic proxy in CTA 0, then read from SMEM using tcgen05.mma in CTA 1 (async proxy), then we need to insert a fence.proxy.async somewhere along the path of

A. generic write to SMEM (CTA 0)

B. cluster sync

C. tcgen05.fence::after_thread_sync (CTA 1)

D. tcgen05.mma (CTA 1)

The paper requirements imply that fence.proxy.async must appear after cluster sync B, in CTA 1. However, all non-tcgen05 instructions after B may be issued in parallel to issued tcgen05.mma, so in principle the fence could execute “too late”. The tcgen05.fence::after_thread_sync instruction only implies ordering between D and instructions issued prior to B (cluster sync is a thread sync), so we need to insert the fence.proxy.async in CTA 0, prior to B, for it to be guaranteed to retire in time for the tcgen05 instruction. This implies that the fence.proxy.async must simultaneously be before B and after B. (Note, inserting two proxy fences in both places won’t solve this dilemma, because the prior logic could imply that neither is effective). This is perhaps a somewhat unrealistic use case, writing to SMEM in another CTA without using TMA, but the purpose of this example is to help illuminate the proxy fence requirements, not implement a specific algorithm.

Topic		Replies	Views
Why aren't there explicit async proxy<->generic proxy fences in the cuda guide TMA prefetching example? CUDA Programming and Performance	4	56	January 19, 2026
When to Use fence.proxy.async.shared::cta in GEMM Scenarios? CUDA Programming and Performance	2	552	December 23, 2024
What is the start marker of the scope containing "async_bulk" in "ptx::cp_async_bulk_commit_group()"? CUDA Programming and Performance cuda	0	92	September 11, 2024
barrier before sync necessary? CUDA Programming and Performance	0	299	January 3, 2019
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29625	April 22, 2015
Questions about cuda memory model: does causality order remain transitive across different scopes CUDA Programming and Performance	4	311	January 12, 2023
Is __threadfence() strong enough to gurantee that CUDA IPC via NVLink accesses the latest data? CUDA Programming and Performance	0	449	June 13, 2023
A bug in CUDA programming guide: link disappear CUDA Programming and Performance	2	232	December 26, 2023
Trying to understand memory fence function example CUDA Programming and Performance	3	14909	March 24, 2018
Question in Memory Synchronization CUDA Programming and Performance cuda , kernel	0	40	November 6, 2025

Understanding the CTA-local Requirements of fence.proxy.async, as documented by "Mixed-proxy extensions for the NVIDIA PTX memory consistency model"

Related topics