TensorRT the inference is slow for the QAT model comparing to the PTQ case

We have trained the object detection model using tensorflow/Keras framework using the FP32 precision and then performed PTQ on a calibration dataset.
We got the TRT engine with good inference speed but the precision is affected significantly, so we decided to perform QAT training.
After a lot of refactoring we have got the final int8 model with precision comparable to FP32 model (sometimes even better), but the speed of the TRT produced engine is considerable slower than for engine generated using PTQ. Using the TRT profiling we have found that a lot of operations are still being executed in FP32, so we have investigated all of that cases and whenever it is possible added Q/DQ nodes or changed the graph to avoid transitions to FP32.
We have found that all differences in speed between the QAT and PTQ engines are because of UpSampling2D layers (translated to ONNX Resize operation), which are up scaling the input tensor using the “nearest mode”. In case of PTQ engine these layers are executed in int8 precision, but for QAT engine input tensors are always dequantized to FP32 and then quantized back to INT8. From the profiling log we see that the total cost of extra DQ/Q nodes kept in the graph are approximately corresponds to the performance loss we see comparing to the PTQ engine.

The sub-graph for QAT model looks like:

The PTQ case:

, { "name" : "ttfnet/conv2d_6/Conv2D + ttfnet/conv2D_up3/Relu", "timeMs" : 20.2009, "averageMs" : 0.169756, "medianMs" : 0.169472, "percentage" : 0.652824 }
, { "name" : "Resize__1176", "timeMs" : 35.3989, "averageMs" : 0.297469, "medianMs" : 0.29744, "percentage" : 1.14397 }
, { "name" : "PWN(ttfnet/add_2/add)", "timeMs" : 95.5724, "averageMs" : 0.803129, "medianMs" : 0.80304, "percentage" : 3.08857 }

Resize operation IO formats for PTQ:

  "LayerType": "Resize",
  "Inputs": [
  {
    "Name": "ttfnet/conv2D_up5/Relu:0",
    "Location": "Device",
    "Dimensions": [1,128,30,90],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],
  "Outputs": [
  {
    "Name": "Resize__1044:0",
    "Location": "Device",
    "Dimensions": [1,128,60,180],
    "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format"
  }],

The QAT case:

, { "name" : "ttfnet/quant_conv2d_6/LastValueQuant/QuantizeAndDequantizeV4/Identity:0 + QuantLinearNode__254 + ttfnet/quant_conv2d_6/Conv2D", "timeMs" : 16.9824, "averageMs" : 0.166494, "medianMs" : 0.166512, "percentage" : 0.542475 }
, { "name" : "DequantLinearNode__615", "timeMs" : 39.9534, "averageMs" : 0.3917, "medianMs" : 0.391776, "percentage" : 1.27625 }
, { "name" : "Resize__1828", "timeMs" : 157.379, "averageMs" : 1.54293, "medianMs" : 1.54294, "percentage" : 5.02722 }
, { "name" : "QuantLinearNode__618", "timeMs" : 109.131, "averageMs" : 1.06991, "medianMs" : 1.07011, "percentage" : 3.48602 }
, { "name" : "PWN(ttfnet/quant_add_2/add)", "timeMs" : 76.8304, "averageMs" : 0.753239, "medianMs" : 0.753024, "percentage" : 2.45422 }

Resize operation IO formats for QAT:

  "Inputs": [
  {
    "Name": "ttfnet/quant_up_sampling2d/LastValueQuant/QuantizeAndDequantizeV4:0",
    "Location": "Device",
    "Dimensions": [1,128,30,90],
    "Format/Datatype": "Row major linear FP32"
  }],
  "Outputs": [
  {
    "Name": "Resize__1696:0",
    "Location": "Device",
    "Dimensions": [1,128,60,180],
    "Format/Datatype": "Row major linear FP32"
  }],

We have tried graph with and without QDQ nodes prior to the Resize operation, but it doesn’t help, Resize always executed as FP operation.
We have also tried to replace the sub-graph with Resize operation by the custom TRT plugin which can perform the up-sampling operation in INT8, HALF and FLOAT formats (sub-graph looks like shown below).

Screenshot from 2022-11-14 20-00-08

During the engine creation, the plugin is being asked for support of different format combinations including formats matching the output format of the previous layer, but in the final profile we see that the transition to FLOAT is still used. In case if plugin reports that the FLOAT datatype is not supported the TensorRT refuses to create engine at all.

Obviously there is something wrong with the onnx model/graph as in PTQ mode it seems like all operations can be executed using int8 precision. So, what is wrong with our model?
Of course the performance penalties can be reduced by specifying --fp16 in addition to --int8, in this case operations mentioned above will be performed in fp16.

1 Like

Hi,

Below is a related document for your reference:

Have you tried to convert the fp32 model with the --int8 format that allows TensorRT to do the quantization directly?
Or this Quantization Toolkit?

Thanks.

Hi,

Yes certainly, we did this “direct quantization” experiment, I have mentioned results as the PTQ case. When I wrote “PTQ on calibration dataset”, I meant that we performed “Post Training Quantization” or “direct quantization” using the calibration samples. So for the PTQ or “direct quantization” case all nodes are executed in INT8 precision, but for the QAT case (Quantization Aware Training) the Resize operation or custom plugin operation are anyway being executed as FP operation surrounded by the unnecessary DQ/Q nodes. Of course for building of QAT network we used the Quantization Toolkit with additional changes needed to quantize our network architecture. So the question is still the same - from the execution point of view both graphs should be identical but why it is not so?
I have tried different TensorRT versions - 8.2, 8.4 and 8.5 but the result is the same in all cases. Of course we need this network to be run on Xavier device, so the TensorRT 8.5 is not our case.

1 Like

Dear @AastaLLL,
As far as I understand @alex379 performed the following steps:

  1. Trained TF2/Keras network in fp32 mode and generate TRT engine using --int8 key. And found that all operations in computation graph are executed in int8 mode.

  2. Next the same network was modified with TensorRT/tools/tensorflow-quantization at main · NVIDIA/TensorRT · GitHub, trained again and new engine was generated (with --int8 key of course).
    And the new engine contains FP32 operations in Resize layers, which is significantly slowdowns inference.

The question is - why it is so, and is it possible to create TRT engine from TF2/Keras model with Resize layers with int8 operations only after quantization aware training ?

Hi, both

Sorry for the late update and thanks for your patience.

Just want to confirm first.
Did you use --int8 to generate the TensorRT engine for the QAT model?

More, it sounds like this issue is not specified to Jetson but to TensorRT. Is that correct?

If yes, could you share the model with and without QAT with us?
(a subgraph with the same architecture should be good)
We want to check this issue with our internal team to get more information.

Thanks.

Hi,

Thank you for response!

Yes, we specified the --int8 option to generate the TensorRT engine (otherwise the trtexec fails with the following error message):

[E] Error[4]: [network.cpp::validate::2830] Error Code 4: Internal Error (Int8 precision has been set for a layer or layer output, but int8 is not configured in the builder)

I agree that looks like this behaviour is not only the Jetson specific, the same behaviour is for PC version of TensorRT (we have tried different TensorRT versions (8.2,8.4,8.5)).

I have attached both models (FP32 and QAT trained):
model_fp32.onnx (2.8 MB)
model_qat.onnx (3.0 MB)

I have also attached the profile information generated by trtexec utility,
metadata_fp32.json (86.3 KB)
metadata_qat.json (97.5 KB)

Thanks.

Hi,

Thanks for the details.
We are checking this issue with our internal team and will share the feedback with you soon.

Just want to confirm again, do you use our Quantization Toolkit to do the QAT training?

Thanks.

Hi,

Yes, we have used the Quantization Toolkit (tensorflow version) to do the QAT training.

Thanks.

Thanks for the confirm.

We are discussing this issue with our internal team.
Will share more information with you later.