4090 doesn't have fp8 compute?

So as an ML inference pipeline dev, I decided to try out the new nvidia TransformerEngine python api, which seems to come with a context manager for fp8_autocast, which was exciting. I decided to try it out on my new 4090, and got this:

AssertionError: Device compute capability 9.x required for FP8 execution.

So-- despite the 4th generation tensor cores on the gpu, it doesn’t support fp8?

I don’t have practical experience with the latest hardware. From what I can glimpse from published information, the RTX 4090 (an Ada Lovelace class GPU, AD102) has compute capability 8.9. NVIDIA’s press releases mention FP8 when describing Hopper class GPUs and specifically GH100, with compute capability 9.0.

Based on that, the error message you encountered seems to be accurate.

[Later:]

This announcement from NVIDIA does mention FP8 support for the RTX 4090:

Ada’s new 4th Generation Tensor Cores are unbelievably fast, with an all new 8-Bit Floating Point (FP8) Tensor Engine, increasing throughput by up to 5X, to 1.32 Tensor-petaFLOPS on the GeForce RTX 4090.

1 Like

That does make sense to me. I expected it not to have the ‘transformer engine’ – but I didn’t expect that fp8 wouldn’t be supported. I searched for a while to find anything about the true specs for the 4090, but could only find the simplified gamer specs.

It would have been nice to have the lack of fp8 compute described anywhere so that people interested in machine learning know what they are buying before committing to an expensive purchase --or, if it is described somewhere, and not just indirectly inferred then I guess I made an incorrect assumption.

@user14984 In the meantime, I found an official NVIDIA announcement from September that FP8 is supported on RTX 4090. See update in my previous post.

Also, the Ada Lovelace architecture description mentions:

Ada’s new fourth-generation Tensor Cores are unbelievably fast, increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine, first introduced in our Hopper H100 datacenter GPU.

1 Like

Ah! Okay, I guess the library itself has that limitation? I guess I’ll just wait to find out. Thanks!

The AD102 paper here, mentions:

“The GeForce RTX 4090 offers double the throughput for existing FP16, BF16, TF32, and INT8 formats, and its Fourth-Generation Tensor Core introduces support for a new FP8 tensor format. Compared to FP16, FP8 halves the data storage requirements and doubles throughput. With the new FP8 format, the GeForce RTX 4090 delivers 1.3 PetaFLOPS of performance for AI inference workloads.”

on page 27.

1 Like

The documentation states this:

Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs,

(emphasis added)

RTX 40 series GPUs are not Hopper GPUs.

This is also consistent with the error message:

Device compute capability 9.x required

Just to be clear, I’m not saying anything about FP8 on Ada Lovelace GPUs, or RTX 40 series GPUs.

Since NVIDIA employees cannot comment on future products, and given that we have established that RTX 4090 hardware supports FP8 per NVIDIA’s public statements, this leaves us with two possibilities:

(1) FP8 support in the library will be added for Ada Lovelace class GPUs at a later time
(2) FP8 support in the library will not be added for Ada Lovelace GPUs for reasons of market segmentation

In practical terms, people interested in FP8 support in this library for Ada Lovelace class GPUs may want to consider filing an enhancement request with NVIDIA to increase the likelihood of (1).

As far as I understand, this library is meant for providing easy access to this new Hopper feature. Apart from compatibility of code using it, would it make sense at all to port it to earlier GPUs?

  1. Such a compatibility/fallback layer can be built by the users into the program using the library.
  2. Currently only the relatively expensive H100 has 9.0, so whoever uses the library, specifically targets that hardware.

The Transformer Engine documentation includes equivalent PyTorch code for comparison.

If by “this new Hopper feature” you mean FP8 support, then the multiple references from official NVIDIA documents earlier in the thread make it clear that FP8 is also supported by the RTX 4090, which is not quite as pricey as a H100.

There might be technical reasons why TransformerEngine cannot be supported on RTX 4090 (although it is not apparent what they might be). There may also be marketing reasons not to support it on the RTX 4090. Only NVIDIA knows which is the case, and per long-standing policy, they would not engage in public discussions about such subject matter.

I meant the Transformer Engine with its “Adaptive Range Tracking” to dynamically choose between formats.

But you are right, even in the posts before it was mentioned that Lovelace would get the new 4th gen Tensor Cores with not only FP8 formats, but also the Transformer Engine.

So I will give another reason: Up to now neither FP8 nor the Transformer Engine even for Hopper is documented in the Cuda Programming Guide nor in the PTX ISA. We should wait until Toolkit 12 with full support for Hopper/Lovelace before we can estimate, how officially supported those features will be by the documentation, libraries and tools. In a GTC talk it was announced that the next minor Toolkit release would not support all Lovelace/Hopper features yet, but Toolkit 12 would. But I deem it a good sign that it is stated that Lovelace has the Transformer Engine.

Turing (7.5) with 2nd gen tensor cores has working BF16 and TF32 tensor capabilities in hardware - TF32 twice as fast on Turing desktop SMs than on Ampere desktop SMs, but only some tools support it (nvdisasm yes, ptxas no) and it is undocumented. The usual neural network libraries have not added support for it. But it also was not advertised.

In the compute space, NVIDIA has a long history of software development being sub-optimally synchronized with hardware development. At the same time, NVIDIA also has a long history of being responsive to customer requests when growing and improving the CUDA ecosystem.

That is why I suggested that people who have an interest that specific hardware features be exposed in various elements of the software stack should file enhancement requests (RFEs).

1 Like

has anyone give it a try with vray Fstorm etc ? I really wonder how it works with render engines.

Cloud Render Farm Kibri Cretaive Solutions