How does the Thread Block Cluster of the Nvidia H100 work concurrently?

Is there anyone who knows more detail about exactly how the h100 thread block clusters work?

How h100 makes a thread block run onto multiple SMs? Can thread blocks in the same cluster only run on a specific group of SMs? Will there be any overlap in the use of SMs?

And by doing this, is there any goodness for dynamic programming or other aspects?

I am just so confused after reading h100 white paper.

Thank you guys XD

It doesn’t. A threadblock is still required to run on a particular SM. A threadblock cluster refers to a group of threadblocks that are cooperating. A single threadblock is still constrained to a single SM.

Yes. The H100 will have a hierarchical organization of SMs. SMs will be organized into groups. A threadblock cluster is constrained that all threadblocks in the cluster must be resident on the SMs associated with a single group. My expectation is that the details here will be largely transparent to the programmer, and is mostly a function of the block scheduler to make this happen/work.

From here:

Clusters enable multiple thread blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.

From that same article, I would recommend reading the section entitled “Thread block clusters”. From that section:

A cluster is a group of thread blocks that are guaranteed to be concurrently scheduled onto a group of SMs, where the goal is to enable efficient cooperation of threads across multiple SMs. The clusters in H100 run concurrently across SMs within a GPC.

(emphasis added)

1 Like

Thank you so much for your reply!!

And I have another questioin:

So will an SM have multiple thread blocks run in at the same time? or it is still a one-to-one relationship?

SMs have always had the ability to run multiple threadblocks at the same time. There is no change to that concept.

1 Like

I see.

Thank you again!

Hi, I have another question.
In the h100 whitepaper, it says:

A dedicated SM-to-SM network for SMs in a GPC provides fast data sharing between threads in a Cluster.

Does this mean that threads in different blocks/SMs (but should be in the same cluster) can share data?

Thank you soooo much!

(emphasis added)

1 Like

Thank you!

Hi, sorry to bother you again XD
I have another question now.
Will a group of SMs related to one cluster overlap/have intersection with a group of SMs related to another cluster?
And how about the lifetime of a cluster and its SMs?

Thank you so much!

The lifetime of a threadblock cluster will be dictated by the lifetime of its threadblocks. If you have no actual experience with CUDA programming, I’m not sure this terminology is going to be helpful or useful for you. SMs are not ephemeral - they are hardware entities. They have a lifetime that is as long as you own your GPU.

Since the primary “new thing” being discussed in the context of Hopper is the threadblock cluster, my expectation is that interaction between clusters is much like interaction between threadblocks today (pre-Hopper): there are no explicit communication mechanisms today for communication threadblock-to-threadblock (other than what you construct yourself via global memory) and there is no formal definition of the hardware/software mapping of threadblocks to SMs other than threadblocks live the duration of their life on a SM. Analogously, I expect that a cluster lives the duration of its life on a GPC.

Again, without any experience in CUDA programming, this may not make much sense.

I see! Thank you so much.
And I do have no experience in CUDA programming, I’ll try it now.

You can get an orderly introduction to CUDA programming using the tutorial series here

wow, that is helpful!!
Thank you again!