Is there anyone who knows more detail about exactly how the h100 thread block clusters work?
How h100 makes a thread block run onto multiple SMs? Can thread blocks in the same cluster only run on a specific group of SMs? Will there be any overlap in the use of SMs?
And by doing this, is there any goodness for dynamic programming or other aspects?
I am just so confused after reading h100 white paper.
It doesn’t. A threadblock is still required to run on a particular SM. A threadblock cluster refers to a group of threadblocks that are cooperating. A single threadblock is still constrained to a single SM.
Yes. The H100 will have a hierarchical organization of SMs. SMs will be organized into groups. A threadblock cluster is constrained that all threadblocks in the cluster must be resident on the SMs associated with a single group. My expectation is that the details here will be largely transparent to the programmer, and is mostly a function of the block scheduler to make this happen/work.
Clusters enable multiple thread blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.
From that same article, I would recommend reading the section entitled “Thread block clusters”. From that section:
A cluster is a group of thread blocks that are guaranteed to be concurrently scheduled onto a group of SMs, where the goal is to enable efficient cooperation of threads across multiple SMs. The clusters in H100 run concurrently across SMs within a GPC.
Hi, sorry to bother you again XD
I have another question now.
Will a group of SMs related to one cluster overlap/have intersection with a group of SMs related to another cluster?
And how about the lifetime of a cluster and its SMs?
The lifetime of a threadblock cluster will be dictated by the lifetime of its threadblocks. If you have no actual experience with CUDA programming, I’m not sure this terminology is going to be helpful or useful for you. SMs are not ephemeral - they are hardware entities. They have a lifetime that is as long as you own your GPU.
Since the primary “new thing” being discussed in the context of Hopper is the threadblock cluster, my expectation is that interaction between clusters is much like interaction between threadblocks today (pre-Hopper): there are no explicit communication mechanisms today for communication threadblock-to-threadblock (other than what you construct yourself via global memory) and there is no formal definition of the hardware/software mapping of threadblocks to SMs other than threadblocks live the duration of their life on a SM. Analogously, I expect that a cluster lives the duration of its life on a GPC.
Again, without any experience in CUDA programming, this may not make much sense.
In CUDA, thread blocks in a grid can optionally be grouped at kernel launch into clusters as shown in Figure 11, and cluster capabilities can be leveraged from the CUDA cooperative_groups API.
Does this mean H100 implements the cluster structure at the software level? Or hardware level?
And I can define a cluster by CUDA? Or it will automatically assign some thread blocks to a cluster based on the work?
Thank you so much.
(BTW I’m asking for the permission of our server, and I’m ready to write some CUDA to help me better understand these.)
cluster is a software concept.
GPC is a hardware concept
Yes, to take advantage of this, you will have to indicate a cluster definition. I don’t have exact details to discuss until a future CUDA toolkit arrives with support for H100. There is no concept of a cluster in the CUDA programming model as of CUDA 11.7.
No, I don’t expect any sort of automatic assignment. However I’m not able to answer these questions with 100% clarity because the actual programming model for this has not been disclosed yet. AFAIK.
Hi Robert,
Hope you’re well!
I’m now curious about how the SM-to-SM network works in H100.
I kind of understand from the whitepaper that each thread now has a virtual address that points to a space in their block’s shared memory. So that when a thread requires data from another block’s thread of the same cluster, it can directly access other SM’s shared memory.
Is my understanding right?
My question is how this “direct access” happens. What path will a load/store/… request go through (like how many hops in the hardware)? I am so confused about its details.