I would like to know if the newer RTX 50xx GPUs would have hardware for faster synchronization across thread blocks on the same SM cluster (the feature released with hopper GPUs). I assume it would be yes as Blackwell supersedes Hopper but we haven’t had any RTX GPUs with that capability yet (to the best of my knowledge).
A good hint about Nvidia’s recent plans is the difference between sm_90 and sm_90a. With the sm_90a features being a Hopper one off or at least only meant for datacenter GPUs and sm_90 as general features for newer generations.
Also see
Cluster size of 8 is forward compatible starting compute capability 9.0
and
The maximum portable cluster size supported is 8; however, NVIDIA Hopper H100 GPU allows for a nonportable cluster size of 16 by opting in.
But it could be that either the RTX 5000 series gets a compute capability below 9.0, or that the sections about forward compatibility are changed to 10.0 only and RTX 5000 is 10.5.
That would be the largest jump since Maxwell (3.7 → 5.0) for this technical version number. And even, if true, would Nvidia have called it Blackwell as well? Perhaps they mixed up with the needed Cuda Toolkit version? Other 3rd party pages specify 10.1.
Here is a bit discussion about the consumer Blackwells also wondering about 12.8: Reddit - Dive into anything
It’s the CUDA toolkit (CTK) version. Remember 11.8 was the first to provide some support for cc8.9/cc9.0. I’m guessing that 12.8 will be the first CTK version to provide some support for the “new” GPUs.
Yes, at some level they are unrelated. For example compute capability (cc) 5.0 has no particular connection to CTK version 5.0. However a given CTK version provides formal support for a range of cc/architectures. There is an “oldest” architecture/cc it supports a “newest” architecture/cc it supports. So it would be fair to say that the current CUDA 12.6 provides no “formal” support for GPU cc higher than 9.0 (or 9.0a, if you prefer).
Nvidia might want to correct the graphic I linked to above then, as all the previous generations listed show the CC value, not the Toolkit release supporting them.
It is 10.0 for the datacenter blackwell cards (e.g. B100, B200) and 12.0 for the consumer blackwell cards (e.g. RTX 5090). There seems to be a compute capability 10.1 card planned, could be the Jetson embedded cards or the announced Digits AI Workstation. Or some datacenter refresh.
The consumer blackwells support clusters of 8 blocks.
For example, the rtx Blackwell whitepaper mentions that each SM in the GB202 gpu has 2 fp64 cores.
In the programming guide the description of cc 12.0 mentions 2 fp64 cores, as well.
10.0 and 10.1 are new for toolkit 12.7 and 12.0 for toolkit 12.8.
Then as @striker159 said, look at the programming guide
And you see, only 12.0 matches. In fairness, the programming guide does not describe 10.1. However, reading the PTX manual shows that 12.0 is the much more likely compute capability. E.g. 10.1 has a special high-performance tensor core with its own memory space. You will find such a thing only on datacenter GPUs with much higher tensor core throughput.
It probably was only for internal testing and probably for datacenter customers to use their blackwell systems early on?
Some people have reported (also on this forum) the nvidia-smi tool showing 12.7 with the r565 driver a few weeks ago.
RTX 5080, 5070 Ti and 5070 are most likely 12.0, too. 10.0 is in some ways more powerful than 12.0 and for the datacenter cards only.
Yes, according to the documentation. Great news! Looking forward to RTX 50x0 thread block cluster benchmarks.
(Small difference: Nvidia seems to have left out on consumer GPUs the Thread block cluster multicast feature for copying from global memory to shared memory of multiple SMs within one thread block cluster.)
The whitepaper doesn’t show 2FP64 cores on GB203 and 205, so I assumed they would be CC 10.0. It does seem weird for NVIDIA to have different CC’s for same gen cards though.