So as an ML inference pipeline dev, I decided to try out the new nvidia TransformerEngine python api, which seems to come with a context manager for fp8_autocast, which was exciting. I decided to try it out on my new 4090, and got this:
AssertionError: Device compute capability 9.x required for FP8 execution.
So-- despite the 4th generation tensor cores on the gpu, it doesn’t support fp8?
I don’t have practical experience with the latest hardware. From what I can glimpse from published information, the RTX 4090 (an Ada Lovelace class GPU, AD102) has compute capability 8.9. NVIDIA’s press releases mention FP8 when describing Hopper class GPUs and specifically GH100, with compute capability 9.0.
Based on that, the error message you encountered seems to be accurate.
This announcement from NVIDIA does mention FP8 support for the RTX 4090:
Ada’s new 4th Generation Tensor Cores are unbelievably fast, with an all new 8-Bit Floating Point (FP8) Tensor Engine, increasing throughput by up to 5X, to 1.32 Tensor-petaFLOPS on the GeForce RTX 4090.
That does make sense to me. I expected it not to have the ‘transformer engine’ – but I didn’t expect that fp8 wouldn’t be supported. I searched for a while to find anything about the true specs for the 4090, but could only find the simplified gamer specs.
It would have been nice to have the lack of fp8 compute described anywhere so that people interested in machine learning know what they are buying before committing to an expensive purchase --or, if it is described somewhere, and not just indirectly inferred then I guess I made an incorrect assumption.
Ada’s new fourth-generation Tensor Cores are unbelievably fast, increasing throughput by up to 5X, to 1.4 Tensor-petaFLOPS using the new FP8 Transformer Engine, first introduced in our Hopper H100 datacenter GPU.
“The GeForce RTX 4090 offers double the throughput for existing FP16, BF16, TF32, and INT8 formats, and its Fourth-Generation Tensor Core introduces support for a new FP8 tensor format. Compared to FP16, FP8 halves the data storage requirements and doubles throughput. With the new FP8 format, the GeForce RTX 4090 delivers 1.3 PetaFLOPS of performance for AI inference workloads.”
If by “this new Hopper feature” you mean FP8 support, then the multiple references from official NVIDIA documents earlier in the thread make it clear that FP8 is also supported by the RTX 4090, which is not quite as pricey as a H100.
There might be technical reasons why TransformerEngine cannot be supported on RTX 4090 (although it is not apparent what they might be). There may also be marketing reasons not to support it on the RTX 4090. Only NVIDIA knows which is the case, and per long-standing policy, they would not engage in public discussions about such subject matter.
I meant the Transformer Engine with its “Adaptive Range Tracking” to dynamically choose between formats.
But you are right, even in the posts before it was mentioned that Lovelace would get the new 4th gen Tensor Cores with not only FP8 formats, but also the Transformer Engine.
So I will give another reason: Up to now neither FP8 nor the Transformer Engine even for Hopper is documented in the Cuda Programming Guide nor in the PTX ISA. We should wait until Toolkit 12 with full support for Hopper/Lovelace before we can estimate, how officially supported those features will be by the documentation, libraries and tools. In a GTC talk it was announced that the next minor Toolkit release would not support all Lovelace/Hopper features yet, but Toolkit 12 would. But I deem it a good sign that it is stated that Lovelace has the Transformer Engine.
Turing (7.5) with 2nd gen tensor cores has working BF16 and TF32 tensor capabilities in hardware - TF32 twice as fast on Turing desktop SMs than on Ampere desktop SMs, but only some tools support it (nvdisasm yes, ptxas no) and it is undocumented. The usual neural network libraries have not added support for it. But it also was not advertised.
In the compute space, NVIDIA has a long history of software development being sub-optimally synchronized with hardware development. At the same time, NVIDIA also has a long history of being responsive to customer requests when growing and improving the CUDA ecosystem.
That is why I suggested that people who have an interest that specific hardware features be exposed in various elements of the software stack should file enhancement requests (RFEs).
I’m also encountering the same problem here. If I comment out the assertion for the compute arch check. The propaganda seems to have fp8, but for transformer_engine calling the cublas function. RuntimeError: /home/victoryang00/Documents/TransformerEngine/transformer_engine/common/include/transformer_engine/logging.h:38 in function check_cublas_: CUBLAS Error: an unsupported value or parameter was passed to the function