How to apply int8 quantization to Transformer on Xavier

Applying “trtexec” to convert onnx model to trt engine, --int8, the OPs like Einsum or MatMul fallback to fp32. But – fp16 ok. So, is there any other way to speed up transformer inference ?

Hi,

It’s more recommended to try the mixed precision for better performance.

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --best
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --best
[08/12/2022-10:31:36] [I] === Model Options ===
[08/12/2022-10:31:36] [I] Format: ONNX
[08/12/2022-10:31:36] [I] Model: /usr/src/tensorrt/data/resnet50/ResNet50.onnx
[08/12/2022-10:31:36] [I] Output:
[08/12/2022-10:31:36] [I] === Build Options ===
[08/12/2022-10:31:36] [I] Max batch: explicit batch
[08/12/2022-10:31:36] [I] Workspace: 16 MiB
[08/12/2022-10:31:36] [I] minTiming: 1
[08/12/2022-10:31:36] [I] avgTiming: 8
[08/12/2022-10:31:36] [I] Precision: FP32+FP16+INT8
[08/12/2022-10:31:36] [I] Calibration: Dynamic
[08/12/2022-10:31:36] [I] Refit: Disabled
[08/12/2022-10:31:36] [I] Sparsity: Disabled
[08/12/2022-10:31:36] [I] Safe mode: Disabled
[08/12/2022-10:31:36] [I] DirectIO mode: Disabled
[08/12/2022-10:31:36] [I] Restricted mode: Disabled
[08/12/2022-10:31:36] [I] Save engine:
[08/12/2022-10:31:36] [I] Load engine:
[08/12/2022-10:31:36] [I] Profiling verbosity: 0
[08/12/2022-10:31:36] [I] Tactic sources: Using default tactic sources
[08/12/2022-10:31:36] [I] timingCacheMode: local
...

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.