Performance of QAT YOLOv7 model is worse?

I follow this guide https://github.com/NVIDIA-AI-IOT/yolo_deepstream/tree/main/yolov7_qat to do QAT for YOLOv7 model. mAP is good, but inference time from profiling is bad. Inference time of int8 QAT engine is 2 times larger than inference time of int8 engine when using calibration with TRT API.
Could you give me some suggestion?

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

@AakankshaS
I have 2 models
qat.engine (38.2 MB)
trt_api.engine (37.0 MB)

Info of qat.engine

=== Performance summary ===
[07/26/2023-16:25:09] [I] Throughput: 43.1401 qps
[07/26/2023-16:25:09] [I] Latency: min = 23.7117 ms, max = 26.6441 ms, mean = 24.0691 ms, median = 24.0912 ms, percentile(90%) = 24.2626 ms, percentile(95%) = 24.9028 ms, percentile(99%) = 26.6423 ms

Info of trt.engine (int8 engine generated from yolov7 with calibration file in TRT API)

=== Performance summary ===
[07/26/2023-16:22:59] [I] Throughput: 29.8922 qps
[07/26/2023-16:22:59] [I] Latency: min = 31.2419 ms, max = 47.0332 ms, mean = 34.2472 ms, median = 32.5824 ms, percentile(90%) = 37.7196 ms, percentile(95%) = 38.7401 ms, percentile(99%) = 47.0332 ms

inference time of qat.engine is much larger than inference time of trt.engine
There is big difference with inference time.

Hi,

Which version of TensorRT are you using? We also recommend that you share environment information such as GPU and CUDA details.
Please try on the latest TensorRT version 8.6 and if you still face the same please share with us issue repro model/commands and complete verbose logs.

Thank you.

My GPU is Gtx1050 Ti, Tensorrt 8.5.1.1. I am using Docker container, mAP of qat engine is better than Trt ptq engine but speed is worse than the one.

Please try on the latest TensorRT version 8.6 and if you still face the same please share with us issue repro model/commands and complete verbose logs.

@spolisetty Thank you so much.
I checked with TRT8.6 in PC with Docker container. Performance of QAT engine model is a litte bit better (may be fluctuation), but it is bad compared to engine generated from calibration with TRT Python API.

[07/28/2023-15:24:48] [I] Throughput: 32.5707 qps
[07/28/2023-15:24:48] [I] Latency: min = 31.2229 ms, max = 32.3268 ms, mean = 31.4741 ms, median = 31.4675 ms, percentile(90%) = 31.5608 ms, percentile(95%) = 31.595 ms, percentile(99%) = 31.6627 ms

Remind that, in the above comment, I attached 2 engines with TRT8.5
Here I attach QAT engine model with TRT8.6.
qat_trt86.engine (38.3 MB)

@spolisetty
I am also confused about this performance table. Performance of TRT PTQ and QAT engines are same in this table.

I also try skipping rules (Nvidia recommended) in this line https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/5af35bab7f6dfca7f1f32d44847b2a91786485f4/yolov7_qat/scripts/qat.py#L160 anh checked inference time from profiling. Inference time is almost same as applying rules.

@spolisetty @mchi
Sorry. Is there any update?

Here is command to convert qat.pt to engine

/usr/src/tensorrt/bin/trtexec --onnx=qat_best_reparam.onnx \
                            --saveEngine=qat_best_reparam_2.engine \
                            --int8 --fp16 --workspace=102400 \
                            --profilingVerbosity=detailed \
                            --useCudaGraph --useSpinWait --noDataTransfers

In this table (NVIDIA report)

Speed of TRT PTQ engine and QAT engine is almost same. But it is not true after checking on PC and Jetson board.

Hi, @johnminho
in Github link, That perf is test on OrinX, which have higher int8/fp16 accelarate rate.

if you want the performance the same as PTQ(Best performance). You should finetune the QDQ placement following the guidance: https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/main/yolov7_qat/doc/Guidance_of_QAT_performance_optimization.md

@haowang
Thanks for response.

That perf is test on OrinX, which have higher int8/fp16 accelarate rate.

I did not checked on OrinX, but I checked on Xavier NX, QAT engine is worse than PTQ engine. I will check on OrinX.

Hi, Would you mind share your trtexec log & yolov7_qat_profile.json & yolov7_qat_layer.json here:

trtexec --onnx=yolov7_qat.onnx --fp16 --int8 --verbose --saveEngine=yolov7_qat.engine --workspace=1024000 --warmUp=500 --duration=10 --useCudaGraph --useSpinWait --noDataTransfers --exportLayerInfo=yolov7_qat_layer.json --profilingVerbosity=detailed --exportProfile=yolov7_qat_profile.json

Hi @johnminho
I tried these two models on A2, PTQ and QAT perf are almost the same.

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov7_dy.onnx --int8 --best --optShapes=images:12x3x640x640 --saveEngine=yolov7_dy_bs12_best.plan
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_dy_bs12_best.plan --batch=12
—> got 260.109 qps

$ /usr/src/tensorrt/bin/trtexec --onnx=yolov7_dy.onnx --int8 --best --optShapes=images:12x3x640x640 --saveEngine=yolov7_dy_bs12_best.plan
$ /usr/src/tensorrt/bin/trtexec --loadEngine=yolov7_qat_bs12_best.plan --batch=12
→ got 266.843 qps

@mchi
Thanks for checking again. Which version of TRT are you using?

@haowang

Hi, Would you mind share your trtexec log & yolov7_qat_profile.json & yolov7_qat_layer.json here:

Thanks. I will share you later. I will check for some devices and TRT versions. I think that the reasons may be hardware and TRT version. Until now, I checked RTX2080Ti and TRT8.2, speed of QAT engine is bad.
I will inform you as soon as possible.

Please keep in touch, thanks.

Which version of TRT are you using? ==> TensorRT 8.5.3.

Please use Lastest TensorRT version for example TensorRT8.5.3 to align.
Thanks

@mchi

Which version of TRT are you using? ==> TensorRT 8.5.3.

I am using TRT8.2.

@haowang

Please use Lastest TensorRT version for example TensorRT8.5.3 to align.

I am going to check for TRT8.5.3

Thank you very much.