This article from January says Rubin SM will have 2x the SFU (special functions unit) performance at least for exponentiation compared to Blackwell on FP32, and 4x the SFU performance on FP16 types; that difference is new for the SFU (but already was available as PTX). Norbert wondered in an forum article (linked at the end of this post) that any change would need a different number of ROM read ports.
That could have relevancy for general CUDA programming.
This is probably already partly implemented in the Blackwell Ultra
where the overall MUFU.EX2 speed was doubled, but is the same between FP32 and FP16.
The focus in the articles is about exponentiation, as it is part of the softmax formula, which is important to combine several AI output values of widely different scales, especially if they signify probabilities. With the exponentiation, the raw values from the neural net are the logarithm of the final output and are precise in any scale, whereas several outputs are normalized to sum to 1.
See also
a previous discussion here in the forum about MUFU.EX2.
Are there any other architectural improvements in the Rubin SM known, which would likely be useful for general compute (instead of AI training or inference) or be available in non-data center GPUs?