Unable to quantization FP8 in TensorRT


Unable to run inference using TensorRT FP8 quantization


TensorRT Version: 8.6.1
GPU Type: RTX 4070 Ti
Nvidia Driver Version: 530
CUDA Version: 12.1
CUDNN Version:
Operating System + Version: Ubuntu 22.04 LTS
Python Version (if applicable): 3.10
TensorFlow Version (if applicable): —
PyTorch Version (if applicable): —
Baremetal or Container (if container which image + tag): baremetal

My Problem

NVIDIA claimed that the new 4th gen TensorCores support FP8 quantization.
So I have an RTX 4070 Ti (ada lovelace archi.) which has the 4th Gen of TensorCores and suport FP8 format. So I have installed CUDA 12.1 and TensorRT 8.6.1 which include kFP8 data type for FP8 format.

However, I cannot successfully run inference using FP8. INT8 on the other hand works fine but FP8 is not working.

Here I have found in limitations section that FP8 is not supported in TensorRT yet.

So basically FP8 is implemented in TensorRT 8.6.1 but it is not supported yet?
Could you please explain to me what that means? Because for me it is not supported means it is not implemented as well… !!

My Questions

  1. How I can quantize a model in FP8 and run inference in FP8 using TensorRT.
  2. Which tool I should use instead of TensorRT to quantize, calibrate and run inference in FP8 format.
  3. When will TensorRT support FP8 quantization format.
  4. Do you have any Early Access software to quantize and use FP8 format.

Thank you in advanced.

Best regards


1 TensorRT 8.6 adds nvinfer1::DataType::kFP8 to the public API in preparation for the introduction of FP8 support in future TensorRT releases. However, FP8 (8-bit floating point) is not supported by TensorRT currently, and attempting to use FP8 will result in an error or undefined behavior.

Please refer to the Developer Guide :: NVIDIA Deep Learning TensorRT Documentation for the same.

Thank you.