I would like to know if the newer RTX 50xx GPUs would have hardware for faster synchronization across thread blocks on the same SM cluster (the feature released with hopper GPUs). I assume it would be yes as Blackwell supersedes Hopper but we haven’t had any RTX GPUs with that capability yet (to the best of my knowledge).
Thanks
compute capabilities beyond 9.0 are not documented yet. I would expect a CUDA update in the future that pertains to new GPUs.
A good hint about Nvidia’s recent plans is the difference between sm_90 and sm_90a. With the sm_90a features being a Hopper one off or at least only meant for datacenter GPUs and sm_90 as general features for newer generations.
Also see
Cluster size of 8 is forward compatible starting compute capability 9.0
and
The maximum portable cluster size supported is 8; however, NVIDIA Hopper H100 GPU allows for a nonportable cluster size of 16 by opting in.
But it could be that either the RTX 5000 series gets a compute capability below 9.0, or that the sections about forward compatibility are changed to 10.0 only and RTX 5000 is 10.5.