After I read the topic " Explicit vs Implicit Quantization", I think explicitly quantized is better than implicitly quantized. But I also found a description that says " DLA does not support Explicit Quantization" in the doc of TRT8. Does it mean int8 inference acceleration with DLA is only possible in implicitly quantized?
Hi,
In implicit precision, only single dynamic range can be set in ITensor. Does it mean that TRT can’t use DLA with per channel quantization? And can I simulate the per channel quantization by IScaleLayer ?
If using calibration, TensorRT only supports PTQ i.e. per-tensor quantization i.e. single scale activation and per-channel scale for weights. For operations such as conv, deconv, and fc, TRT computes per-channel kernel scales using a single scale from input activation, per-channel scale from weight, and a single scale from output activation.
If using QDQ ops, TRT does support both PTQ and PCQ (per-channel quantization).