So as an ML inference pipeline dev, I decided to try out the new nvidia TransformerEngine python api, which seems to come with a context manager for fp8_autocast, which was exciting. I decided to try it out on my new 4090, and got this:
AssertionError: Device compute capability 9.x required for FP8 execution.
So-- despite the 4th generation tensor cores on the gpu, it doesn’t support fp8?
I don’t have practical experience with the latest hardware. From what I can glimpse from published information, the RTX 4090 (an Ada Lovelace class GPU, AD102) has compute capability 8.9. NVIDIA’s press releases mention FP8 when describing Hopper class GPUs and specifically GH100, with compute capability 9.0.
Based on that, the error message you encountered seems to be accurate.
[Later:]
This announcement from NVIDIA does mention FP8 support for the RTX 4090:
Ada’s new 4th Generation Tensor Cores are unbelievably fast, with an all new 8-Bit Floating Point (FP8) Tensor Engine, increasing throughput by up to 5X, to 1.32 Tensor-petaFLOPS on the GeForce RTX 4090.
That does make sense to me. I expected it not to have the ‘transformer engine’ – but I didn’t expect that fp8 wouldn’t be supported. I searched for a while to find anything about the true specs for the 4090, but could only find the simplified gamer specs.
It would have been nice to have the lack of fp8 compute described anywhere so that people interested in machine learning know what they are buying before committing to an expensive purchase --or, if it is described somewhere, and not just indirectly inferred then I guess I made an incorrect assumption.
Ada’s new fourth-generation Tensor Cores are unbelievably fast, increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine, first introduced in our Hopper H100 datacenter GPU.
“The GeForce RTX 4090 offers double the throughput for existing FP16, BF16, TF32, and INT8 formats, and its Fourth-Generation Tensor Core introduces support for a new FP8 tensor format. Compared to FP16, FP8 halves the data storage requirements and doubles throughput. With the new FP8 format, the GeForce RTX 4090 delivers 1.3 PetaFLOPS of performance for AI inference workloads.”
Transformer Engine (TE) is a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper GPUs,
(emphasis added)
RTX 40 series GPUs are not Hopper GPUs.
This is also consistent with the error message:
Device compute capability 9.x required
Just to be clear, I’m not saying anything about FP8 on Ada Lovelace GPUs, or RTX 40 series GPUs.
Since NVIDIA employees cannot comment on future products, and given that we have established that RTX 4090 hardware supports FP8 per NVIDIA’s public statements, this leaves us with two possibilities:
(1) FP8 support in the library will be added for Ada Lovelace class GPUs at a later time
(2) FP8 support in the library will not be added for Ada Lovelace GPUs for reasons of market segmentation
In practical terms, people interested in FP8 support in this library for Ada Lovelace class GPUs may want to consider filing an enhancement request with NVIDIA to increase the likelihood of (1).
As far as I understand, this library is meant for providing easy access to this new Hopper feature. Apart from compatibility of code using it, would it make sense at all to port it to earlier GPUs?
Such a compatibility/fallback layer can be built by the users into the program using the library.
Currently only the relatively expensive H100 has 9.0, so whoever uses the library, specifically targets that hardware.
The Transformer Engine documentation includes equivalent PyTorch code for comparison.
If by “this new Hopper feature” you mean FP8 support, then the multiple references from official NVIDIA documents earlier in the thread make it clear that FP8 is also supported by the RTX 4090, which is not quite as pricey as a H100.
There might be technical reasons why TransformerEngine cannot be supported on RTX 4090 (although it is not apparent what they might be). There may also be marketing reasons not to support it on the RTX 4090. Only NVIDIA knows which is the case, and per long-standing policy, they would not engage in public discussions about such subject matter.
I meant the Transformer Engine with its “Adaptive Range Tracking” to dynamically choose between formats.
But you are right, even in the posts before it was mentioned that Lovelace would get the new 4th gen Tensor Cores with not only FP8 formats, but also the Transformer Engine.
So I will give another reason: Up to now neither FP8 nor the Transformer Engine even for Hopper is documented in the Cuda Programming Guide nor in the PTX ISA. We should wait until Toolkit 12 with full support for Hopper/Lovelace before we can estimate, how officially supported those features will be by the documentation, libraries and tools. In a GTC talk it was announced that the next minor Toolkit release would not support all Lovelace/Hopper features yet, but Toolkit 12 would. But I deem it a good sign that it is stated that Lovelace has the Transformer Engine.
Turing (7.5) with 2nd gen tensor cores has working BF16 and TF32 tensor capabilities in hardware - TF32 twice as fast on Turing desktop SMs than on Ampere desktop SMs, but only some tools support it (nvdisasm yes, ptxas no) and it is undocumented. The usual neural network libraries have not added support for it. But it also was not advertised.
In the compute space, NVIDIA has a long history of software development being sub-optimally synchronized with hardware development. At the same time, NVIDIA also has a long history of being responsive to customer requests when growing and improving the CUDA ecosystem.
That is why I suggested that people who have an interest that specific hardware features be exposed in various elements of the software stack should file enhancement requests (RFEs).
I’m also encountering the same problem here. If I comment out the assertion for the compute arch check. The propaganda seems to have fp8, but for transformer_engine calling the cublas function. RuntimeError: /home/victoryang00/Documents/TransformerEngine/transformer_engine/common/include/transformer_engine/logging.h:38 in function check_cublas_: CUBLAS Error: an unsupported value or parameter was passed to the function
I’m using danielpoochai/FloatSD (github.com) with smooth calculation, it seems to be cuBLAS cancel out the computing arch of 89 for fp8 API. Hopefully, Nvidia can make it right!
Not all the tensor cores are 4th gen tensor cores that supports fp8, maybe cuBLAS is making changes to its SM scheduler for scheduling fp8 operation different then H100?
So I pay big bucks for a 4090 which has fp8 hardware but I can’t use it because customers are less importantly than marketing segmentation. Am I reading this correctly? How much more did this cost me by having this 4th gen stuff which I can’t use?
As far as I know (I don’t have a 4090) this sample code will run on RTX 4090 and it demonstrates that fp8 matmul (i.e. tensorcore) is supported and available to 4090 users. this thread may also be of historical interest.
Have you tried to use it with the current CUDA version 12.4? If so, what did you observe? In my post from 2022 I provided two possible explanations for fp8 support not being provided by software in usable fashion on the RTX 4090 at that time. You appear to have jumped on one of them.
Nothing, actually. I don’t have an RTX 4090 so I cannot tell you what is or isn’t currently supported on it in terms of fp8, but the price you pay for an RTX 4090 has nothing to do with how much it cost to make the chip(s). NVIDIA presumably would not sell a GPU below manufacturing cost, but generally speaking even that is not something outside the realm of reality.
Hi, this still does not seem to work, do you know who we can ask about this?
This the error, it is still not supported on ADA
“transformer_engine.pytorch.DotProductAttention. It seems that only device sm_arch >=90 can select FP8 sub-backend, while the sm_arch of ada device is 89. Is really FP8 computation now supported on Ada?” ( github . com/NVIDIA/TransformerEngine/issues/15)