Hi,
I’m looking for an explanation of how int8 TensorRT ops with multiple inputs are implemented, for example element-wise addition. In particular, I’m wondering how things work when the two inputs have very different quantization scales. One implementation I can image is just loading each of the int8 input tensors, de-quantizing each using its own quantization scale, converting to a higher-precision format, doing the addition in the higher-precision format, and then quantizing back to int8 for the output using the output’s quantization scale. Is that how it works? Are there any good references on these kinds of internals?
Cheers,
Eric
Hi @eric.crawford ,
Yes, each input can have its own scale.
I am checking if there are any refs available to be shared.
Thanks
Thanks for the reply. I am also interested in the same question for the DLA, i.e. details on the implementation of element-wise ops, and any complications that might arise from having inputs with two very different scales. I am currently trying to deploy a network on an Orin DLA at int8 precision, but the output of the network is nonsense (even though the same network works at fp16 on the DLA, and works at int8 on the Orin GPU). I’m starting to wonder if it might be something to do with these element-wise ops, since unlike the GPU the DLA seems not to fuse these element-wise ops with preceding convolutions.
If it’s helpful I’m using Jetpack 5.1.3 (TensorRT 8.5.0.2, CUDA 11.4, CUDNN 8.6, etc).