Rubin SM Rumours

This article from January says Rubin SM will have 2x the SFU (special functions unit) performance at least for exponentiation compared to Blackwell on FP32, and 4x the SFU performance on FP16 types; that difference is new for the SFU (but already was available as PTX). Norbert wondered in an forum article (linked at the end of this post) that any change would need a different number of ROM read ports.

That could have relevancy for general CUDA programming.

This is probably already partly implemented in the Blackwell Ultra

where the overall MUFU.EX2 speed was doubled, but is the same between FP32 and FP16.

The focus in the articles is about exponentiation, as it is part of the softmax formula, which is important to combine several AI output values of widely different scales, especially if they signify probabilities. With the exponentiation, the raw values from the neural net are the logarithm of the final output and are precise in any scale, whereas several outputs are normalized to sum to 1.

See also

a previous discussion here in the forum about MUFU.EX2.

Are there any other architectural improvements in the Rubin SM known, which would likely be useful for general compute (instead of AI training or inference) or be available in non-data center GPUs?

With today’s transistor counts (~330 billion in Rubin, from what I gather), the additional hardware expenditure necessary to double the throughput of MUFU.EX2 likely disappears in the rounding errors when accounting for silicon real estate.

Since the 2024 post of mine referenced above I have studied the MUFU.EX2 implementation based on published literature and built myself a bit-accurate emulation, mostly because I was interested in the curious imbalance in computational error between positive and negative arguments; I successfully identified the root cause of that.

As it turned out, the MUFU design cuts down its components to the absolute minimum needed, while optimizing for both latency and transistor count.

1 Like