Unable to run inference using TensorRT FP8 quantization
TensorRT Version: 8.6.1
GPU Type: RTX 4070 Ti
Nvidia Driver Version: 530
CUDA Version: 12.1
CUDNN Version: 184.108.40.206
Operating System + Version: Ubuntu 22.04 LTS
Python Version (if applicable): 3.10
TensorFlow Version (if applicable): —
PyTorch Version (if applicable): —
Baremetal or Container (if container which image + tag): baremetal
NVIDIA claimed that the new 4th gen TensorCores support FP8 quantization.
So I have an RTX 4070 Ti (ada lovelace archi.) which has the 4th Gen of TensorCores and suport FP8 format. So I have installed CUDA 12.1 and TensorRT 8.6.1 which include kFP8 data type for FP8 format.
However, I cannot successfully run inference using FP8. INT8 on the other hand works fine but FP8 is not working.
Here I have found in limitations section that FP8 is not supported in TensorRT yet.
So basically FP8 is implemented in TensorRT 8.6.1 but it is not supported yet?
Could you please explain to me what that means? Because for me it is not supported means it is not implemented as well… !!
- How I can quantize a model in FP8 and run inference in FP8 using TensorRT.
- Which tool I should use instead of TensorRT to quantize, calibrate and run inference in FP8 format.
- When will TensorRT support FP8 quantization format.
- Do you have any Early Access software to quantize and use FP8 format.
Thank you in advanced.