Nvidia and fp64?

A report back in May 2025 mentioned that in 2026, AMD plans to release a card (MI430X UL4) optimized for HPC double precision workloads. From TechPowerUp (May 14, 2025)

From the article:

“AMD is gearing up to expand its Instinct MI family in the latter half of 2026 with two purpose‑built UDNA‑based accelerators. These new models, MI430X UL4 and MI450X, will cater respectively to high‑precision HPC tasks and large‑scale AI workloads. The MI430X UL4 is designed for applications that rely heavily on double‑precision FP64 floating‑point performance, such as scientific simulations, climate modeling, and others. It features a large array of FP64 tensor cores, …”

Any chance Nvidia might have similar plans? A few petaflops-per-chip of fp4 is one thing, but a petaflops-per-chip of fp64 would be something else.

The H100 SXM has 67 TFLOPS of Tensor FP64 + 34 TFLOPS of conventional FP64.

The AMD MI-300X has 163.4+81.7 TFLOPS.

They both are not at PETAFLOPS for single GPUs or accelerators.

1 Like

Indeed! Hence that 2026 MI430X UL4 may be of interest for HPC.

But is there any indication it provides PFLOPS FP64 speed? Typically a new generation may double the speed.

(Nvidia even reduced the FP64 speed with Blackwell compared to Hopper)

And the UL4 is a system of 4 cards not a single one. Nvidia also has offers for combining multiple GPUs.

NVIDIA’s next-generation GPU architecture is called Rubin. It will be paired with a next-generation CPU Vera. NVIDIA announced those for 2nd half of 2025. A slide from Jensen’s presentations says Rubin comprises two reticle-sized GPU dies with 50 PFLOPS FP4 and 288 GB of HBM4. At the system level (rack scale) the projected performance is 3.6 EFLOPS FP4 and 1.2 EFLOPS FP8. I do not see any data on FP64 throughput, which is not very surprising given that > 90% of NVIDIA’s revenue comes from AI.

Historical observation indicates that (1) practically achieved floating-point throughput as a percentage of theoretical FLOPS is usually significantly less with AMD GPUs compared to NVIDIA GPUs; (2) AMD has a systemic issue with lackluster software support going back some 25 years; (3) NVIDIA isn’t standing still and resting on its laurels even when it is ahead of the competition.

Even with the use of HBM4 memory I would expect most FP64-heavy codes to be limited by memory bandwidth.

1 Like

I’m not intending to directly address the question, but wish to present possibly relevant information that may be of interest:

NVIDIA offers an FP64 emulation mode that could be used for experimentation with the HPL benchmark, as described here (25.04 HPC Benchmarks container).

Specifically excerpting:

NVIDIA HPL with FP64 emulation

Version 25.04 of the NVIDIA HPL benchmark supports FP64 emulation mode [1] on the NVIDIA Blackwell GPU architecture, using the techniques described in [2]. This is an opt-in feature, and the default mode remains the use of the native FP64 computations.

Environment variables to set up and control the NVIDIA HPL Benchmark FP64 emulation mode:

HPL_EMULATE_DOUBLE_PRECISION: Enables/disables FP64 emulation mode

  • Default Value: 0
  • Possible Values: 1 (enable), 0 (disable)

HPL_DOUBLE_PRECISION_EMULATION_MANTISSA_BIT_COUNT: The maximum number of mantissa bits to be used for FP64 emulation [2] (includes IEEE FP64 standard’s implicit bit)

  • Default Value: 53
  • Possible Values: >0

Note:

  • The number of slices (INT8 data elements [2]) can be calculated as: nSlices = ceildiv((mantissaBitCount + 1), sizeofBits(INT8)), where the additional bit is used for the sign (+/-) of the value.
  • In the current iteration of the NVIDIA HPL benchmark, FP64 emulation utilizes INT8 data elements and compute resources. This may change in future releases.
2 Likes

What struck me was that after a decade of enabling AI, HPC might once again be getting some love?

Let’s say MI430X does land at 200-250 TFLOP late 2026. Since initial target is for 4-way connectivity (“UL4”), then with a couple more node-shrinks, such a chip becomes a chiplet.

With 4 such chiplets on a card, they’d not be far from a petaFLOP of fp64. Of course, if AMD could do this, Nvidia could too. Hence that frisson of interest, and that question in the original post.

Moore’s Law is dead. Thus the implementation of Rubin “R100” as two reticle-size (ca. 850mm2, a bit more than one square inch) dies. I wouldn’t call a reticle-size chip a chiplet, but that’s just me. The cost per die and the power per die will be very high.

For-profit companies tend to give love to market segment(s) that make loads of money, either currently or in a longer-term growth scenario. Sure, the current AI bubble will deflate eventually, but I don’t see classical HPC with FP64 driving the sales volume needed to compensate. Could there be a halo product for the HPC market? Possibly.

1 Like