Rubin SM Rumours

Curefab · April 18, 2026, 5:04pm

This article from January says Rubin SM will have 2x the SFU (special functions unit) performance at least for exponentiation compared to Blackwell on FP32, and 4x the SFU performance on FP16 types; that difference is new for the SFU (but already was available as PTX). Norbert wondered in an forum article (linked at the end of this post) that any change would need a different number of ROM read ports.

That could have relevancy for general CUDA programming.

This is probably already partly implemented in the Blackwell Ultra

where the overall MUFU.EX2 speed was doubled, but is the same between FP32 and FP16.

The focus in the articles is about exponentiation, as it is part of the softmax formula, which is important to combine several AI output values of widely different scales, especially if they signify probabilities. With the exponentiation, the raw values from the neural net are the logarithm of the final output and are precise in any scale, whereas several outputs are normalized to sum to 1.

See also

a previous discussion here in the forum about MUFU.EX2.

Are there any other architectural improvements in the Rubin SM known, which would likely be useful for general compute (instead of AI training or inference) or be available in non-data center GPUs?

njuffa · April 19, 2026, 3:34am

With today’s transistor counts (~330 billion in Rubin, from what I gather), the additional hardware expenditure necessary to double the throughput of MUFU.EX2 likely disappears in the rounding errors when accounting for silicon real estate.

Since the 2024 post of mine referenced above I have studied the MUFU.EX2 implementation based on published literature and built myself a bit-accurate emulation, mostly because I was interested in the curious imbalance in computational error between positive and negative arguments; I successfully identified the root cause of that.

As it turned out, the MUFU design cuts down its components to the absolute minimum needed, while optimizing for both latency and transistor count.

Topic		Replies	Views
More accurate version of exp2f() with no change in performance CUDA Programming and Performance	0	868	February 17, 2023
How to use the `ex2.approx.f16x2` instruction? CUDA Programming and Performance	2	665	August 28, 2024
Overlap between TensorCore GEMM operation and Softmax (exp) operation CUDA Programming and Performance cuda , kernel	9	492	September 20, 2024
SFUs CUDA Programming and Performance	4	6495	April 16, 2008
Implement faster cuda intrinsics for specific power functions CUDA Programming and Performance cuda	6	4596	November 4, 2020
expf vs __expf ? CUDA Programming and Performance	3	4337	April 23, 2009
On the utility of SFU instructions for half-precision math functions CUDA Programming and Performance	8	2730	September 16, 2019
Anticipating future standards: Implementation of exp2m1f() CUDA Programming and Performance	1	620	August 11, 2021
CUDA and Exponentiation Legacy PGI Compilers	3	5380	August 18, 2010
A faster and more accurate implementation of expm1f() CUDA Programming and Performance	1	1413	December 18, 2019

Rubin SM Rumours

Related topics