I’m researching safe (safer) programming languages for NVIDIA GPUs, and I have some questions on reconciling the official PTX proxy fence documentation with the paper “Mixed-proxy extensions for the NVIDIA PTX memory consistency model”.
The surface-level question is, when synchronizing a generic proxy write in one CTA with an async proxy read in another CTA (in the same cluster), should the fence.proxy.async be before or after the cluster sync? The paper describes that the fence must be after the cluster sync, which leads to my deeper question, why do the PTX docs imply this requirement? (I must understand why, not merely know what to do in a specific case, to be able to model what safety means in a programming language). The interaction with Blackwell tcgen05 is also unclear.
Question 1/3: Figure 8e has a caption “When synchronizing threads are in different CTAs, the proxy fence must be inserted in the CTA containing the non-generic operation”. However the code listing seems to contradict this, showing the fence.proxy.constant instruction in thread 0 (w/ st.global – generic proxy), not in thread 1 (w/ ld.const – constant proxy). Is this an error in the paper?
Question 2/3: Section 5.4 states “The proxy fence must also be inserted in the same CTA where the non-generic access is taking place”. How is this requirement established by the PTX documentation? I presume that “proxy-preserved base causality order” is part of this, since this is where the “same thread block” distinction for non-generic proxies is established. From there I’m lost as to how a proxy fence instruction is able to interact with causality order. If X is a generic proxy write in CTA 0, Y is an async proxy read in CTA 1, and Z is a fence.proxy.async, then the causality order requirements require that
“For some operation Z [proxy fence?], X [generic write] precedes Z in observation order, and Z precedes Y [async read] in proxy-preserved base causality order”
but observation order is defined for reads, writes, and atomics only, and the fence isn’t any of these (or my deduction that a fence.proxy.async is the “some operation Z” needed here is incorrect).
The rules are clear enough when a write/read pair is done be the same proxy (with the same virtual address and same CTA). But if I have a write/read pair in different proxies, how do the PTX rules establish that the addition of a proxy fence makes this safe (when it otherwise would have been unsafe)?
Question 3/3: How does this interact with Blackwell tcgen05? If we do a write to SMEM in the generic proxy in CTA 0, then read from SMEM using tcgen05.mma in CTA 1 (async proxy), then we need to insert a fence.proxy.async somewhere along the path of
A. generic write to SMEM (CTA 0)
B. cluster sync
C. tcgen05.fence::after_thread_sync (CTA 1)
D. tcgen05.mma (CTA 1)
The paper requirements imply that fence.proxy.async must appear after cluster sync B, in CTA 1. However, all non-tcgen05 instructions after B may be issued in parallel to issued tcgen05.mma, so in principle the fence could execute “too late”. The tcgen05.fence::after_thread_sync instruction only implies ordering between D and instructions issued prior to B (cluster sync is a thread sync), so we need to insert the fence.proxy.async in CTA 0, prior to B, for it to be guaranteed to retire in time for the tcgen05 instruction. This implies that the fence.proxy.async must simultaneously be before B and after B. (Note, inserting two proxy fences in both places won’t solve this dilemma, because the prior logic could imply that neither is effective). This is perhaps a somewhat unrealistic use case, writing to SMEM in another CTA without using TMA, but the purpose of this example is to help illuminate the proxy fence requirements, not implement a specific algorithm.