Thread block clustering in Blackwell GPUs

I would like to know if the newer RTX 50xx GPUs would have hardware for faster synchronization across thread blocks on the same SM cluster (the feature released with hopper GPUs). I assume it would be yes as Blackwell supersedes Hopper but we haven’t had any RTX GPUs with that capability yet (to the best of my knowledge).

Thanks

compute capabilities beyond 9.0 are not documented yet. I would expect a CUDA update in the future that pertains to new GPUs.

A good hint about Nvidia’s recent plans is the difference between sm_90 and sm_90a. With the sm_90a features being a Hopper one off or at least only meant for datacenter GPUs and sm_90 as general features for newer generations.

Also see

Cluster size of 8 is forward compatible starting compute capability 9.0

and

The maximum portable cluster size supported is 8; however, NVIDIA Hopper H100 GPU allows for a nonportable cluster size of 16 by opting in.

But it could be that either the RTX 5000 series gets a compute capability below 9.0, or that the sections about forward compatibility are changed to 10.0 only and RTX 5000 is 10.5.

Thanks, this is a good way to think about it!

Seems like they’ve allocated 12.8 for RTX50XX:
The Ultimate GeForce GPU Comparison 50 Series Specs

If I am not wrong cuda version and SM architecture can be unrelated

That would be the largest jump since Maxwell (3.7 → 5.0) for this technical version number. And even, if true, would Nvidia have called it Blackwell as well? Perhaps they mixed up with the needed Cuda Toolkit version? Other 3rd party pages specify 10.1.

Here is a bit discussion about the consumer Blackwells also wondering about 12.8: Reddit - Dive into anything

It’s the CUDA toolkit (CTK) version. Remember 11.8 was the first to provide some support for cc8.9/cc9.0. I’m guessing that 12.8 will be the first CTK version to provide some support for the “new” GPUs.

Yes, at some level they are unrelated. For example compute capability (cc) 5.0 has no particular connection to CTK version 5.0. However a given CTK version provides formal support for a range of cc/architectures. There is an “oldest” architecture/cc it supports a “newest” architecture/cc it supports. So it would be fair to say that the current CUDA 12.6 provides no “formal” support for GPU cc higher than 9.0 (or 9.0a, if you prefer).

Nvidia might want to correct the graphic I linked to above then, as all the previous generations listed show the CC value, not the Toolkit release supporting them.

1 Like

For anyone seeing this later (after Jan 25th). The SM version for the blackwell cards is >10.0 and it is going to support thread block clusters.

It is 10.0 for the datacenter blackwell cards (e.g. B100, B200) and 12.0 for the consumer blackwell cards (e.g. RTX 5090). There seems to be a compute capability 10.1 card planned, could be the Jetson embedded cards or the announced Digits AI Workstation. Or some datacenter refresh.

The consumer blackwells support clusters of 8 blocks.

Where does it say the RTX50xx cards are compute capability 12.0? I only see 12.8 on the spec sheet which as discussed seems like the CTK version.

For example, the rtx Blackwell whitepaper mentions that each SM in the GB202 gpu has 2 fp64 cores.
In the programming guide the description of cc 12.0 mentions 2 fp64 cores, as well.

For example here you can see, which compute capability version was supported with with toolkit:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes

10.0 and 10.1 are new for toolkit 12.7 and 12.0 for toolkit 12.8.

Then as @striker159 said, look at the programming guide

And you see, only 12.0 matches. In fairness, the programming guide does not describe 10.1. However, reading the PTX manual shows that 12.0 is the much more likely compute capability. E.g. 10.1 has a special high-performance tensor core with its own memory space. You will find such a thing only on datacenter GPUs with much higher tensor core throughput.

I don’t see any toolkit 12.7 release here.

The 12.6.3 release doesn’t seem to mention cc higher than 9.x

Although I wouldn’t call it official documentation, it does state here that the current crop of RTX 50 series cards belong to cc12.0.

Blackwell B200 is defined as 10.0 here.

Understood! Makes sense now. Thanks for such sharp observations!

To summarize the RTX50xx GPUs:

GPU GB Code SM count Core count Compute Capability / SM Arch
RTX 5090 GB202 192 24,576 12.0
RTX 5080 GB203 84 10,752 10.0 (maybe?)
RTX 5070 Ti / 5070 GB205 50 6,400 10.0 (maybe?)

Also to reiterate, all of them will support thread block cluster based synchronization.

The table in the PTX manual lists

PTX ISA 8.6 CUDA 12.7, driver r565 sm_{10,11,12,13}, sm_20, sm_{30,32,35,37}, sm_{50,52,53}, sm_{60,61,62}, sm_{70,72,75}, sm_{80,86,87,89}, sm_{90,90a}, sm_{100,100a}, sm_{101,101a}
PTX ISA 8.7 CUDA 12.8, driver r570 sm_{10,11,12,13}, sm_20, sm_{30,32,35,37}, sm_{50,52,53}, sm_{60,61,62}, sm_{70,72,75}, sm_{80,86,87,89}, sm_{90,90a}, sm_{100,100a}, sm_{101,101a}, sm_{120,120a}

It probably was only for internal testing and probably for datacenter customers to use their blackwell systems early on?
Some people have reported (also on this forum) the nvidia-smi tool showing 12.7 with the r565 driver a few weeks ago.

RTX 5080, 5070 Ti and 5070 are most likely 12.0, too. 10.0 is in some ways more powerful than 12.0 and for the datacenter cards only.

Yes, according to the documentation. Great news! Looking forward to RTX 50x0 thread block cluster benchmarks.
(Small difference: Nvidia seems to have left out on consumer GPUs the Thread block cluster multicast feature for copying from global memory to shared memory of multiple SMs within one thread block cluster.)

1 Like

The whitepaper doesn’t show 2FP64 cores on GB203 and 205, so I assumed they would be CC 10.0. It does seem weird for NVIDIA to have different CC’s for same gen cards though.