TensorRT the inference is slow for the QAT model comparing to the PTQ case

alex379 · November 14, 2022, 5:25pm

We have trained the object detection model using tensorflow/Keras framework using the FP32 precision and then performed PTQ on a calibration dataset.
We got the TRT engine with good inference speed but the precision is affected significantly, so we decided to perform QAT training.
After a lot of refactoring we have got the final int8 model with precision comparable to FP32 model (sometimes even better), but the speed of the TRT produced engine is considerable slower than for engine generated using PTQ. Using the TRT profiling we have found that a lot of operations are still being executed in FP32, so we have investigated all of that cases and whenever it is possible added Q/DQ nodes or changed the graph to avoid transitions to FP32.
We have found that all differences in speed between the QAT and PTQ engines are because of UpSampling2D layers (translated to ONNX Resize operation), which are up scaling the input tensor using the “nearest mode”. In case of PTQ engine these layers are executed in int8 precision, but for QAT engine input tensors are always dequantized to FP32 and then quantized back to INT8. From the profiling log we see that the total cost of extra DQ/Q nodes kept in the graph are approximately corresponds to the performance loss we see comparing to the PTQ engine.

The sub-graph for QAT model looks like:

The PTQ case:

, { "name" : "ttfnet/conv2d_6/Conv2D + ttfnet/conv2D_up3/Relu", "timeMs" : 20.2009, "averageMs" : 0.169756, "medianMs" : 0.169472, "percentage" : 0.652824 }
, { "name" : "Resize__1176", "timeMs" : 35.3989, "averageMs" : 0.297469, "medianMs" : 0.29744, "percentage" : 1.14397 }
, { "name" : "PWN(ttfnet/add_2/add)", "timeMs" : 95.5724, "averageMs" : 0.803129, "medianMs" : 0.80304, "percentage" : 3.08857 }

Resize operation IO formats for PTQ:

  "LayerType": "Resize",
  "Inputs": [
  {
    "Name": "ttfnet/conv2D_up5/Relu:0",
    "Location": "Device",
    "Dimensions": [1,128,30,90],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],
  "Outputs": [
  {
    "Name": "Resize__1044:0",
    "Location": "Device",
    "Dimensions": [1,128,60,180],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],

The QAT case:

, { "name" : "ttfnet/quant_conv2d_6/LastValueQuant/QuantizeAndDequantizeV4/Identity:0 + QuantLinearNode__254 + ttfnet/quant_conv2d_6/Conv2D", "timeMs" : 16.9824, "averageMs" : 0.166494, "medianMs" : 0.166512, "percentage" : 0.542475 }
, { "name" : "DequantLinearNode__615", "timeMs" : 39.9534, "averageMs" : 0.3917, "medianMs" : 0.391776, "percentage" : 1.27625 }
, { "name" : "Resize__1828", "timeMs" : 157.379, "averageMs" : 1.54293, "medianMs" : 1.54294, "percentage" : 5.02722 }
, { "name" : "QuantLinearNode__618", "timeMs" : 109.131, "averageMs" : 1.06991, "medianMs" : 1.07011, "percentage" : 3.48602 }
, { "name" : "PWN(ttfnet/quant_add_2/add)", "timeMs" : 76.8304, "averageMs" : 0.753239, "medianMs" : 0.753024, "percentage" : 2.45422 }

Resize operation IO formats for QAT:

  "Inputs": [
  {
    "Name": "ttfnet/quant_up_sampling2d/LastValueQuant/QuantizeAndDequantizeV4:0",
    "Location": "Device",
    "Dimensions": [1,128,30,90],
    "Format/Datatype": "Row major linear FP32"
  }],
  "Outputs": [
  {
    "Name": "Resize__1696:0",
    "Location": "Device",
    "Dimensions": [1,128,60,180],
    "Format/Datatype": "Row major linear FP32"
  }],

We have tried graph with and without QDQ nodes prior to the Resize operation, but it doesn’t help, Resize always executed as FP operation.
We have also tried to replace the sub-graph with Resize operation by the custom TRT plugin which can perform the up-sampling operation in INT8, HALF and FLOAT formats (sub-graph looks like shown below).

Screenshot from 2022-11-14 20-00-08

During the engine creation, the plugin is being asked for support of different format combinations including formats matching the output format of the previous layer, but in the final profile we see that the transition to FLOAT is still used. In case if plugin reports that the FLOAT datatype is not supported the TensorRT refuses to create engine at all.

Obviously there is something wrong with the onnx model/graph as in PTQ mode it seems like all operations can be executed using int8 precision. So, what is wrong with our model?
Of course the performance penalties can be reduced by specifying --fp16 in addition to --int8, in this case operations mentioned above will be performed in fp16.

AastaLLL · November 15, 2022, 2:41am

Hi,

Below is a related document for your reference:

Have you tried to convert the fp32 model with the --int8 format that allows TensorRT to do the quantization directly?
Or this Quantization Toolkit?

Thanks.

alex379 · November 15, 2022, 9:36am

Hi,

Yes certainly, we did this “direct quantization” experiment, I have mentioned results as the PTQ case. When I wrote “PTQ on calibration dataset”, I meant that we performed “Post Training Quantization” or “direct quantization” using the calibration samples. So for the PTQ or “direct quantization” case all nodes are executed in INT8 precision, but for the QAT case (Quantization Aware Training) the Resize operation or custom plugin operation are anyway being executed as FP operation surrounded by the unnecessary DQ/Q nodes. Of course for building of QAT network we used the Quantization Toolkit with additional changes needed to quantize our network architecture. So the question is still the same - from the execution point of view both graphs should be identical but why it is not so?
I have tried different TensorRT versions - 8.2, 8.4 and 8.5 but the result is the same in all cases. Of course we need this network to be run on Xavier device, so the TensorRT 8.5 is not our case.

ikrukov1 · November 15, 2022, 2:02pm

Dear @AastaLLL,
As far as I understand @alex379 performed the following steps:

Trained TF2/Keras network in fp32 mode and generate TRT engine using --int8 key. And found that all operations in computation graph are executed in int8 mode.
Next the same network was modified with TensorRT/tools/tensorflow-quantization at main · NVIDIA/TensorRT · GitHub, trained again and new engine was generated (with --int8 key of course).
And the new engine contains FP32 operations in Resize layers, which is significantly slowdowns inference.

The question is - why it is so, and is it possible to create TRT engine from TF2/Keras model with Resize layers with int8 operations only after quantization aware training ?

AastaLLL · November 24, 2022, 5:13am

Hi, both

Sorry for the late update and thanks for your patience.

Just want to confirm first.
Did you use --int8 to generate the TensorRT engine for the QAT model?

More, it sounds like this issue is not specified to Jetson but to TensorRT. Is that correct?

If yes, could you share the model with and without QAT with us?
(a subgraph with the same architecture should be good)
We want to check this issue with our internal team to get more information.

Thanks.

alex379 · November 24, 2022, 10:44pm

Hi,

Thank you for response!

Yes, we specified the --int8 option to generate the TensorRT engine (otherwise the trtexec fails with the following error message):

[E] Error[4]: [network.cpp::validate::2830] Error Code 4: Internal Error (Int8 precision has been set for a layer or layer output, but int8 is not configured in the builder)

I agree that looks like this behaviour is not only the Jetson specific, the same behaviour is for PC version of TensorRT (we have tried different TensorRT versions (8.2,8.4,8.5)).

I have attached both models (FP32 and QAT trained):
model_fp32.onnx (2.8 MB)
model_qat.onnx (3.0 MB)

I have also attached the profile information generated by trtexec utility,
metadata_fp32.json (86.3 KB)
metadata_qat.json (97.5 KB)

Thanks.

AastaLLL · November 28, 2022, 7:30am

Hi,

Thanks for the details.
We are checking this issue with our internal team and will share the feedback with you soon.

Just want to confirm again, do you use our Quantization Toolkit to do the QAT training?

Thanks.

alex379 · November 28, 2022, 1:11pm

Hi,

Yes, we have used the Quantization Toolkit (tensorflow version) to do the QAT training.

Thanks.

AastaLLL · November 29, 2022, 6:52am

Thanks for the confirm.

We are discussing this issue with our internal team.
Will share more information with you later.

gcunhasergio · December 13, 2022, 3:42pm

Hi @alex379 , can you please also let us know which TF model you are using with the TF-QAT toolkit? Thanks.

alex379 · December 15, 2022, 10:33am

Hi gcunhasergio, we are using custom model which is very similar to the CenterNet implementation provided as part of Tensorflow object detection API. In our case we have introduced a a couple of modifications to allow TensorRT to quantize all layers except UpSampling2D.

AastaLLL · December 16, 2022, 12:43am

Hi,

Could you share the detailed steps and model with us?
We want to investigate why the TF Quantization inserts the Q/DQ node after the Conv.

Thanks.

alex379 · December 23, 2022, 1:17pm

Hi @AastaLLL,

The detailed steps and dataset itself are rather large, moreover the problem with insertion of Q/DQ node “after the Conv”, if I properly understood what you mean, is not related to the TF Quantization framework, I think. In the TRT quantization scheme Q/DQ nodes are generated for inputs, not for outputs, so the Q/DQ node is not inserted after the Conv but inserted for input of the “UpSampling2D” layer.
As I mentioned above we have investigated different scenarios and in case if the input of the UpSampling2D layer is not quantized the situation is ever worse than for the case when this Q/DQ node is inserted.
So, let consider the following two cases:

No custon QDOInsertion rule is applied, and no Q/DQ node is injected before the UpSampling2D layer (Resize op in ONNX)
custom QDQInsertion rule is applied and Q/DQ node is inserted for input of the UpSampling2D layer (Resize op in ONNX)

In case 1 the execution of the sub-graph will be the following:
ConvBNRelu(Int8:FP32)->Resize(FP32:FP32)->Quantize(FP32:Int8)

The ONNX file quantized using the TF Quantization framework with no CustomUpSamplingQDQInsertionCase:
onnx_model.onnx (5.7 MB)
metadata_qat.json (86.6 KB)
As you can see from the metadata qat.json even with –int8 and –fp16 flags the Resize operation is anyway executed on FP32 precision.

In case 2 the execution of the sub-graph will be the following:
ConvBNRelu(Int8:Int8)->Dequantize(Int8:FP16)->Resize(FP16:FP16)->Quantize(FP16:Int8)

The ONNX file quantize using TF Quantization framework with CustomUpSamplingQDQInsertionCase:
onnx_model.onnx (5.7 MB)
metadata_qat.json (86.4 KB)

The performance (speed of inference) for case 2 (with custom rule) is better than for the case 1, so previously I have published the graph with this node present.

So, my question is still the same, how to generate plan where the mentioned sub-graph is executed as in case of the QAT trained model:
ConvBNRelu(Int8:Int8)->Resize(Int8:Int8)

Thanks.

AastaLLL · December 26, 2022, 7:34am

Hi,

We have some discussion internally.

The root cause is there is an extra pair of QDQ nodes following Relu “QTTFNet/conv2D_up5/Relu”.
To verify this, we did a test with a simple model with UpSampling2D layer.

But in our testing, UpSample2D is converted to ONNX as a combination of the below layers:

QDQ nodes are not added before Shape or Resize layers.

We found your Upsampling layer is different from ours.
It includes additional layers such as Gather, Cast, Mul, and Concat.

Could you share the exact layer that you are using?

Thanks.

alex379 · December 26, 2022, 11:27am

Hi @AastaLLL,

Thank you for the reply!
In my previous post I have attached two variants with and without extra Q/DQ nodes, and noted that without extra Q/DQ nodes the problem with network inference speed is ever worse than when these Q/DQ nodes are present. My problem is how to keep both the network precision and the inference speed, and from the profiling I see that a lot of performance is spent for conversion from Int8 to FP16/FP32 even in cases when these conversions are nonsense.

I have prepared a simple tensorflow model which can be used to reproduce the problem:
qmodel.py (1.7 KB)

After you run this script it will create two onnx models in the following folders:

models/FP32/model.onnx - the FP32 model without Q/DQ nodes
models/INT8/model.onnx - the FP32 model with Q/DQ nodes (no Q/DQ in between Relu and Resize)

To just estimate the model performance you can issue the following commands:

in FP32 folder - trtexec --onnx=model.onnx --int8 --shapes=input_1:0:1x224x224x3
in INT8 folder - trtexec --onnx=model.onnx --int8 --shapes=input_1:1x224x224x3

If everything is done properly and all nodes are being executed in INT8 precision, model graphs should be equal and the estimated QPS for each model should be the equal too!!!
On my PC I have the following results:

FP32 model - consider it as the PTQ case - [12/26/2022-13:40:46] [I] Throughput: 17375.5 qps
INT8 model - consider it as QAT case - [12/26/2022-13:41:44] [I] Throughput: 11603.9 qps

In other words the quantized via –int8 QAT model is 1.5 times slower than the FP32 model quantized via the –int8 quantization !!! It looks like a big performance loss - this is exactly my problem, because I prefer fast and accurate models to only fast or only accurate …

To investigate the reason it is possible to issue the following additional commands:

FP32 folder - trtexec --onnx=./model.onnx --int8 --shapes=input_1:0:1x224x224x3 --exportProfile=profile.json --exportLayerInfo=metadata.json --profilingVerbosity=detailed
INT8 folder - trtexec --onnx=./model.onnx --int8 --shapes=input_1:1x224x224x3 --exportProfile=profile.json --exportLayerInfo=metadata.json --profilingVerbosity=detailed

Compare the corresponding mobile.json & profile.json and see where the performance difference is. In my case the problem is because of in case of QAT converted model the (Conv->Resize) operation which is executed at Int8 precision for PTQ case transformed to (Conv->Reformat->Resize->Quantize) chain which is being executed in FP32 precision.

PS

Blockquote
We found your Upsampling layer is different from ours.
It includes additional layers such as Gather, Cast, Mul, and Concat.

It is clear why this set of operations are present in the graph, so please do not spent a time for investigation. We both (you and me) have the same implementation of the UpSampling2D, the chain of operations you meant are responsible for the output shape computation in case of dynamic tensors (as you can see it just multiplies by 2 the input shape - something like F([B,*Shape])*2). So there are two alternatives:

input = Input(shape=(224,224,3))

and

input = Input(shape=(None,None,3))

In the second case it is not possible to compute the output tensor size statically, instead the expression is used, and in graph you see the expression.
The my model architecture is the fully-convolution anchorless detector so I can easily change the model input size and at the inference stage use the input size different comparing to the size of the tensor used during training.

AastaLLL · December 27, 2022, 2:17am

Hi,

Thanks for the detailed explanation and reproducible source.

We are discussing this with our internal team.
Will let you know once we found something.

Thanks a lot.

AastaLLL · January 13, 2023, 2:48am

Hi,

Our dev team has tested this issue on TensorRT 8.5 and the resize layer can run on INT8 format already.

Resize operation IO formats for PTQ:

  "Inputs": [
  {
    "Name": "StatefulPartitionedCall/test/re_lu_1/Relu6:0",
    "Location": "Device",
    "Dimensions": [1,128,1,1],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],
  "Outputs": [
  {
    "Name": "Resize__64:0",
    "Location": "Device",
    "Dimensions": [1,128,2,2],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],

Resize operation IO formats for QAT:

  "Inputs": [
  {
    "Name": "Reformatted Input Tensor 0 to Resize__234",
    "Location": "Device",
    "Dimensions": [1,128,1,1],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],
  "Outputs": [
  {
    "Name": "QuantLinearNode__166:0",
    "Location": "Device",
    "Dimensions": [1,128,2,2],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],

So please wait for the package for Jetson.
TensorRT 8.5 should be included in the upcoming JetPack release.

Thanks.

alex379 · January 13, 2023, 5:15pm

Hi @AastaLLL,

Thank you for the investigation. I have downloaded TensorRT 8.5.2.2 and verified that now it can convert all operations Int8.
It is good to know that the new JetPack release will include the updated TensorRT, and according to the Jetson Roadmap new version should be released this month, so could you please share the exact TensorRT version which will be included?

Thank you for the investigation again!

AastaLLL · January 16, 2023, 2:01am

Hi,

The issue is fixed in TensorRT 8.5.
Thanks.

system · January 30, 2023, 2:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRT generated QAT engine, why the engine is bigger than pretrained fp16 engine? TensorRT	3	1370	January 4, 2022
How to verify if QAT TRT engine is indeed INT8 on Xavier Jetson AGX Xavier tensorrt	16	746	October 5, 2022
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	885	December 3, 2023
Post-Training Quantization (PTQ) for semantic segmentation model running on Jetson Orin NX Jetson Orin NX tensorrt	24	502	March 26, 2025
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	2054	June 14, 2021
Confused about the design concept of Explicit quantization Q/DQ node in pytorh_quantizaiton toolkit TensorRT	5	1021	April 27, 2022
Performance of QAT YOLOv7 model is worse? TensorRT	16	1102	August 3, 2023
Post quantization aware training is slower than fp16 and post quantization TensorRT	12	2847	September 25, 2024
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2449	January 6, 2022
Some questions about TensorRT INT8, PTQ and QAT TensorRT tensorrt	5	1916	December 27, 2021

TensorRT the inference is slow for the QAT model comparing to the PTQ case

Related topics