How does the Thread Block Cluster of the Nvidia H100 work concurrently?

Is there anyone who knows more detail about exactly how the h100 thread block clusters work?

How h100 makes a thread block run onto multiple SMs? Can thread blocks in the same cluster only run on a specific group of SMs? Will there be any overlap in the use of SMs?

And by doing this, is there any goodness for dynamic programming or other aspects?

I am just so confused after reading h100 white paper.

Thank you guys XD

It doesn’t. A threadblock is still required to run on a particular SM. A threadblock cluster refers to a group of threadblocks that are cooperating. A single threadblock is still constrained to a single SM.

Yes. The H100 will have a hierarchical organization of SMs. SMs will be organized into groups. A threadblock cluster is constrained that all threadblocks in the cluster must be resident on the SMs associated with a single group. My expectation is that the details here will be largely transparent to the programmer, and is mostly a function of the block scheduler to make this happen/work.

From here:

Clusters enable multiple thread blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.

From that same article, I would recommend reading the section entitled “Thread block clusters”. From that section:

A cluster is a group of thread blocks that are guaranteed to be concurrently scheduled onto a group of SMs, where the goal is to enable efficient cooperation of threads across multiple SMs. The clusters in H100 run concurrently across SMs within a GPC.

(emphasis added)

2 Likes

Thank you so much for your reply!!

And I have another questioin:

So will an SM have multiple thread blocks run in at the same time? or it is still a one-to-one relationship?

SMs have always had the ability to run multiple threadblocks at the same time. There is no change to that concept.

1 Like

I see.

Thank you again!

Hi, I have another question.
In the h100 whitepaper, it says:

A dedicated SM-to-SM network for SMs in a GPC provides fast data sharing between threads in a Cluster.

Does this mean that threads in different blocks/SMs (but should be in the same cluster) can share data?

Thank you soooo much!

(emphasis added)

1 Like

Thank you!

Hi, sorry to bother you again XD
I have another question now.
Will a group of SMs related to one cluster overlap/have intersection with a group of SMs related to another cluster?
And how about the lifetime of a cluster and its SMs?

Thank you so much!

The lifetime of a threadblock cluster will be dictated by the lifetime of its threadblocks. If you have no actual experience with CUDA programming, I’m not sure this terminology is going to be helpful or useful for you. SMs are not ephemeral - they are hardware entities. They have a lifetime that is as long as you own your GPU.

Since the primary “new thing” being discussed in the context of Hopper is the threadblock cluster, my expectation is that interaction between clusters is much like interaction between threadblocks today (pre-Hopper): there are no explicit communication mechanisms today for communication threadblock-to-threadblock (other than what you construct yourself via global memory) and there is no formal definition of the hardware/software mapping of threadblocks to SMs other than threadblocks live the duration of their life on a SM. Analogously, I expect that a cluster lives the duration of its life on a GPC.

Again, without any experience in CUDA programming, this may not make much sense.

I see! Thank you so much.
And I do have no experience in CUDA programming, I’ll try it now.

You can get an orderly introduction to CUDA programming using the tutorial series here

wow, that is helpful!!
Thank you again!

Hi Robert!
New question agian. XD
I noticed that:

In CUDA, thread blocks in a grid can optionally be grouped at kernel launch into clusters as shown in Figure 11, and cluster capabilities can be leveraged from the CUDA cooperative_groups API.

Does this mean H100 implements the cluster structure at the software level? Or hardware level?
And I can define a cluster by CUDA? Or it will automatically assign some thread blocks to a cluster based on the work?

Thank you so much.
(BTW I’m asking for the permission of our server, and I’m ready to write some CUDA to help me better understand these.)

cluster is a software concept.
GPC is a hardware concept

Yes, to take advantage of this, you will have to indicate a cluster definition. I don’t have exact details to discuss until a future CUDA toolkit arrives with support for H100. There is no concept of a cluster in the CUDA programming model as of CUDA 11.7.

No, I don’t expect any sort of automatic assignment. However I’m not able to answer these questions with 100% clarity because the actual programming model for this has not been disclosed yet. AFAIK.

Thank you sooooooooo much!

Hi Robert,
Hope you’re well!
I’m now curious about how the SM-to-SM network works in H100.

I kind of understand from the whitepaper that each thread now has a virtual address that points to a space in their block’s shared memory. So that when a thread requires data from another block’s thread of the same cluster, it can directly access other SM’s shared memory.
Is my understanding right?

My question is how this “direct access” happens. What path will a load/store/… request go through (like how many hops in the hardware)? I am so confused about its details.

Thank you so much!

I don’t have any information for that beyond what is published.

I see, thank you for your reply!