Description
Hello,
I am currently optimizing an XLM-RoBERTa model (a BERT-like model). I have used two different optimization approaches:
- The first approach was to insert fake quantization nodes (Int8 Smooth Quant) using the
model_opt
tool, and then perform INT8 precision optimization with TensorRT. - The second approach was to directly perform FP16 precision optimization with TensorRT.
After completing these optimizations, I found that the INT8 quantized model had a similar runtime to the FP16 model, but the context size was twice as large as the FP16 model. I expected that after INT8 quantization, both the runtime and context size would have better performance than the FP16 model, but surprisingly, the actual runtime and memory usage were worse.
I expanded the TensorRT engine’s layer graph for both quantization methods using the trex
(trt-engine-explorer) tool and noticed that the computation graph for FP16 in the multi-head attention layer was much simpler compared to INT8.
I have two following questions:
- Is there a convenient way to perform INT8 quantization on a simpler computation graph? I suspect that this might achieve better performance than the FP16 model. (By the way, is my assumption correct and reasonable?)
- Are there any other methods to further optimize the model to achieve better performance than the FP16 model?
Environment
TensorRT Version: 10.3
GPU Type: A100
Nvidia Driver Version: 525.105.17
CUDA Version: 12.0