Optimizing XLM-RoBERTa: Seeking Better Performance with INT8 Quantization Over FP16

Description

Hello,

I am currently optimizing an XLM-RoBERTa model (a BERT-like model). I have used two different optimization approaches:

  1. The first approach was to insert fake quantization nodes (Int8 Smooth Quant) using the model_opt tool, and then perform INT8 precision optimization with TensorRT.
  2. The second approach was to directly perform FP16 precision optimization with TensorRT.

After completing these optimizations, I found that the INT8 quantized model had a similar runtime to the FP16 model, but the context size was twice as large as the FP16 model. I expected that after INT8 quantization, both the runtime and context size would have better performance than the FP16 model, but surprisingly, the actual runtime and memory usage were worse.

I expanded the TensorRT engine’s layer graph for both quantization methods using the trex (trt-engine-explorer) tool and noticed that the computation graph for FP16 in the multi-head attention layer was much simpler compared to INT8.

I have two following questions:

  1. Is there a convenient way to perform INT8 quantization on a simpler computation graph? I suspect that this might achieve better performance than the FP16 model. (By the way, is my assumption correct and reasonable?)
  2. Are there any other methods to further optimize the model to achieve better performance than the FP16 model?

Environment

TensorRT Version: 10.3
GPU Type: A100
Nvidia Driver Version: 525.105.17
CUDA Version: 12.0

Hi @786253643 ,
Checking on the details.