Performance of QAT YOLOv7 model is worse?

johnminho · July 26, 2023, 7:16am

I follow this guide https://github.com/NVIDIA-AI-IOT/yolo_deepstream/tree/main/yolov7_qat to do QAT for YOLOv7 model. mAP is good, but inference time from profiling is bad. Inference time of int8 QAT engine is 2 times larger than inference time of int8 engine when using calibration with TRT API.
Could you give me some suggestion?

AakankshaS · July 26, 2023, 12:37pm

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

johnminho · July 26, 2023, 4:40pm

@AakankshaS
I have 2 models
qat.engine (38.2 MB)
trt_api.engine (37.0 MB)

Info of qat.engine

=== Performance summary ===
[07/26/2023-16:25:09] [I] Throughput: 43.1401 qps
[07/26/2023-16:25:09] [I] Latency: min = 23.7117 ms, max = 26.6441 ms, mean = 24.0691 ms, median = 24.0912 ms, percentile(90%) = 24.2626 ms, percentile(95%) = 24.9028 ms, percentile(99%) = 26.6423 ms

Info of trt.engine (int8 engine generated from yolov7 with calibration file in TRT API)

=== Performance summary ===
[07/26/2023-16:22:59] [I] Throughput: 29.8922 qps
[07/26/2023-16:22:59] [I] Latency: min = 31.2419 ms, max = 47.0332 ms, mean = 34.2472 ms, median = 32.5824 ms, percentile(90%) = 37.7196 ms, percentile(95%) = 38.7401 ms, percentile(99%) = 47.0332 ms

inference time of qat.engine is much larger than inference time of trt.engine
There is big difference with inference time.

spolisetty · July 27, 2023, 3:25pm

Hi,

Which version of TensorRT are you using? We also recommend that you share environment information such as GPU and CUDA details.
Please try on the latest TensorRT version 8.6 and if you still face the same please share with us issue repro model/commands and complete verbose logs.

Thank you.

johnminho · July 27, 2023, 4:01pm

My GPU is Gtx1050 Ti, Tensorrt 8.5.1.1. I am using Docker container, mAP of qat engine is better than Trt ptq engine but speed is worse than the one.

spolisetty · July 28, 2023, 11:37am

Please try on the latest TensorRT version 8.6 and if you still face the same please share with us issue repro model/commands and complete verbose logs.

johnminho · July 28, 2023, 3:31pm

@spolisetty Thank you so much.
I checked with TRT8.6 in PC with Docker container. Performance of QAT engine model is a litte bit better (may be fluctuation), but it is bad compared to engine generated from calibration with TRT Python API.

[07/28/2023-15:24:48] [I] Throughput: 32.5707 qps
[07/28/2023-15:24:48] [I] Latency: min = 31.2229 ms, max = 32.3268 ms, mean = 31.4741 ms, median = 31.4675 ms, percentile(90%) = 31.5608 ms, percentile(95%) = 31.595 ms, percentile(99%) = 31.6627 ms

Remind that, in the above comment, I attached 2 engines with TRT8.5
Here I attach QAT engine model with TRT8.6.
qat_trt86.engine (38.3 MB)

johnminho · July 29, 2023, 2:11am

@spolisetty
I am also confused about this performance table. Performance of TRT PTQ and QAT engines are same in this table.

I also try skipping rules (Nvidia recommended) in this line https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/5af35bab7f6dfca7f1f32d44847b2a91786485f4/yolov7_qat/scripts/qat.py#L160 anh checked inference time from profiling. Inference time is almost same as applying rules.

johnminho · August 3, 2023, 7:33am

@spolisetty @mchi
Sorry. Is there any update?

Here is command to convert qat.pt to engine

/usr/src/tensorrt/bin/trtexec --onnx=qat_best_reparam.onnx \
                            --saveEngine=qat_best_reparam_2.engine \
                            --int8 --fp16 --workspace=102400 \
                            --profilingVerbosity=detailed \
                            --useCudaGraph --useSpinWait --noDataTransfers

In this table (NVIDIA report)

Speed of TRT PTQ engine and QAT engine is almost same. But it is not true after checking on PC and Jetson board.

haowang · August 3, 2023, 8:27am

Hi, @johnminho
in Github link, That perf is test on OrinX, which have higher int8/fp16 accelarate rate.

if you want the performance the same as PTQ(Best performance). You should finetune the QDQ placement following the guidance: https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/main/yolov7_qat/doc/Guidance_of_QAT_performance_optimization.md

johnminho · August 3, 2023, 8:33am

@haowang
Thanks for response.

That perf is test on OrinX, which have higher int8/fp16 accelarate rate.

I did not checked on OrinX, but I checked on Xavier NX, QAT engine is worse than PTQ engine. I will check on OrinX.

haowang · August 3, 2023, 8:43am

Hi, Would you mind share your trtexec log & yolov7_qat_profile.json & yolov7_qat_layer.json here:

trtexec --onnx=yolov7_qat.onnx --fp16 --int8 --verbose --saveEngine=yolov7_qat.engine --workspace=1024000 --warmUp=500 --duration=10 --useCudaGraph --useSpinWait --noDataTransfers --exportLayerInfo=yolov7_qat_layer.json --profilingVerbosity=detailed --exportProfile=yolov7_qat_profile.json

mchi · August 3, 2023, 8:44am

Hi @johnminho
I tried these two models on A2, PTQ and QAT perf are almost the same.

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov7_dy.onnx --int8 --best --optShapes=images:12x3x640x640 --saveEngine=yolov7_dy_bs12_best.plan
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_dy_bs12_best.plan --batch=12
—> got 260.109 qps

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov7_dy.onnx --int8 --best --optShapes=images:12x3x640x640 --saveEngine=yolov7_dy_bs12_best.plan
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_qat_bs12_best.plan --batch=12
→ got 266.843 qps

johnminho · August 3, 2023, 9:17am

@mchi
Thanks for checking again. Which version of TRT are you using?

@haowang

Hi, Would you mind share your trtexec log & yolov7_qat_profile.json & yolov7_qat_layer.json here:

Thanks. I will share you later. I will check for some devices and TRT versions. I think that the reasons may be hardware and TRT version. Until now, I checked RTX2080Ti and TRT8.2, speed of QAT engine is bad.
I will inform you as soon as possible.

Please keep in touch, thanks.

mchi · August 3, 2023, 9:19am

Which version of TRT are you using? ==> TensorRT 8.5.3.

haowang · August 3, 2023, 9:21am

Please use Lastest TensorRT version for example TensorRT8.5.3 to align.
Thanks

johnminho · August 3, 2023, 9:24am

@mchi

Which version of TRT are you using? ==> TensorRT 8.5.3.

I am using TRT8.2.

@haowang

Please use Lastest TensorRT version for example TensorRT8.5.3 to align.

I am going to check for TRT8.5.3

Thank you very much.

Topic		Replies	Views
How to verify if QAT TRT engine is indeed INT8 on Xavier Jetson AGX Xavier tensorrt	16	798	October 5, 2022
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2518	January 6, 2022
TensorRT the inference is slow for the QAT model comparing to the PTQ case Jetson AGX Xavier tensorrt , nvbugs	19	1851	January 16, 2023
TensorRT generated QAT engine, why the engine is bigger than pretrained fp16 engine? TensorRT	3	1419	January 4, 2022
YOLOv5 QAT model inference empty && pytorch-quantization-toolkit TensorRT	4	2217	December 7, 2021
Some questions about TensorRT INT8, PTQ and QAT TensorRT tensorrt	5	1953	December 27, 2021
Convert YOLOv7 QAT model to TensorRT engine failure Jetson AGX Xavier yolo	9	1272	June 21, 2023
Build yolov8 QAT model int8 engine failed TensorRT cudnn	2	118	July 31, 2025
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	2116	June 14, 2021
What's different .trt and .engine of model? Jetson AGX Xavier tensorrt	12	1865	November 24, 2021

Performance of QAT YOLOv7 model is worse?

Related topics