How to verify if QAT TRT engine is indeed INT8 on Xavier

srsjd · August 26, 2022, 10:59am

I converted a QAT model using trtexec on a xavier compute. There is no obvious improvement on inference time. I used nvprof to lookup actual tensorcores that were used. The nvprof log is attached. I can’t quite tell if they are fp32 or int8 tensorcores by their names. Can someone take a look at it and let me know if I have converted my model to INT8 engine correctly? Thanks.
qat-log.txt (281.8 KB)

AastaLLL · August 29, 2022, 2:34am

Hi,

To convert a model into an INT8 engine, please make sure you have added the --int8 configuration.
Or the default mode which is fp32 will be used.

More, please remember to maximize the device performance before benchmarking.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

srsjd · August 29, 2022, 3:02am

Thanks. I did add --int8 configuration, and there is no speed improvement. I think there might be mistakes made during QAT and resulted the model can only be converted to a FP32 engine. However I hope to verify types of engines directly instead of using circumstantial evidence such as inference speed. How can I do that?

AastaLLL · August 29, 2022, 3:06am

Hi,

Would you mind sharing the model with us so we can check it further?
Thanks.

srsjd · August 29, 2022, 3:11am

best.trt (51.6 MB)
Thanks. Here it is and the platform is xavier jetpack 4.6

AastaLLL · August 29, 2022, 8:51am

Hi,

Would you mind sharing the original ONNX file with us?
Thanks.

srsjd · August 29, 2022, 9:47am

I have sent it to you though private message. Let me know if you have received it.

Can I assume that there is little information can be extracted from engine? I’m asking this because many nvidia hardware do not support INT8 as much as they support FP16. Some layers even perform faster as FP16 than as INT8. So the best QAT setup stragety seems to be using the PTQ-optimized network structure. Can we learn specifically what layers are quantized and what are not from the PTQ engine?

srsjd · August 30, 2022, 7:22am

@AastaLLL Hello, do we know the cause of the low inference speed now?

srsjd · September 1, 2022, 1:57am

@AastaLLL Hello, any updates?

AastaLLL · September 1, 2022, 6:34am

Hi,

We are able to download your model and now are checking internally.
Will share more information with you later.

Thanks.

srsjd · September 2, 2022, 8:57am

@AastaLLL Hi, what have we learnt so far? Were you able to reproduce the issue?

AastaLLL · September 5, 2022, 7:34am

Hi,

We test your model on Xavier with the latest TensorRT 8.4 (JetPack 5.0.2) and below is the qps:

$ /usr/src/tensorrt/bin/trtexec --onnx=best.onnx --int8
...
[09/05/2022-07:30:19] [I] === Performance summary ===
[09/05/2022-07:30:19] [I] Throughput: 42.1906 qps
[09/05/2022-07:30:19] [I] Latency: min = 23.4788 ms, max = 23.9562 ms, mean = 23.6918 ms, median = 23.6862 ms, percentile(99%) = 23.9382 ms
[09/05/2022-07:30:19] [I] Enqueue Time: min = 2.93506 ms, max = 4.29883 ms, mean = 4.06823 ms, median = 4.17404 ms, percentile(99%) = 4.25513 ms
[09/05/2022-07:30:19] [I] H2D Latency: min = 0.13208 ms, max = 0.144775 ms, mean = 0.137683 ms, median = 0.137695 ms, percentile(99%) = 0.142853 ms
[09/05/2022-07:30:19] [I] GPU Compute Time: min = 23.2721 ms, max = 23.7436 ms, mean = 23.4812 ms, median = 23.4781 ms, percentile(99%) = 23.7234 ms
[09/05/2022-07:30:19] [I] D2H Latency: min = 0.0644531 ms, max = 0.0783691 ms, mean = 0.0728661 ms, median = 0.0727844 ms, percentile(99%) = 0.0778809 ms
[09/05/2022-07:30:19] [I] Total Host Walltime: 3.03385 s
[09/05/2022-07:30:19] [I] Total GPU Compute Time: 3.0056 s
[09/05/2022-07:30:19] [I] Explanations of the performance metrics are printed in the verbose logs.

Not sure which framework you compare to.
Could you share the performance from your side as well (both TensorRT and third-party library)?

Thanks.

srsjd · September 5, 2022, 9:35am

@AastaLLL I compared the int8 model with original yolov5l fp16 model. Here are the results. As you can see the fp16 model is actually faster than the int8 model. I used jetpack 4.6 so results should be faster on your compute.

I’ve sent you the onnx file. Let me know if you could download it.

[09/05/2022-05:57:21] [I] === Trace details ===
[09/05/2022-05:57:21] [I] Trace averages of 10 runs:
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.128 ms - Host latency: 28.3464 ms (end to end 28.3566 ms, enqueue 11.9572 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.217 ms - Host latency: 28.43 ms (end to end 28.4389 ms, enqueue 11.3588 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.0949 ms - Host latency: 28.3138 ms (end to end 28.324 ms, enqueue 11.9387 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.1479 ms - Host latency: 28.3625 ms (end to end 28.373 ms, enqueue 11.4544 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.3071 ms - Host latency: 28.5275 ms (end to end 28.5706 ms, enqueue 11.9875 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 27.9723 ms - Host latency: 28.1838 ms (end to end 28.1932 ms, enqueue 11.8031 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.0657 ms - Host latency: 28.2825 ms (end to end 28.29 ms, enqueue 11.2075 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 27.809 ms - Host latency: 28.0188 ms (end to end 28.028 ms, enqueue 11.7692 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.082 ms - Host latency: 28.301 ms (end to end 28.3123 ms, enqueue 11.0856 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 27.7799 ms - Host latency: 27.989 ms (end to end 28.0001 ms, enqueue 11.803 ms)
[09/05/2022-05:57:21] [I] 
[09/05/2022-05:57:21] [I] === Performance summary ===
[09/05/2022-05:57:21] [I] Throughput: 35.3304 qps
[09/05/2022-05:57:21] [I] Latency: min = 27.8552 ms, max = 29.8593 ms, mean = 28.2913 ms, median = 28.2172 ms, percentile(99%) = 29.7705 ms
[09/05/2022-05:57:21] [I] End-to-End Host Latency: min = 27.865 ms, max = 29.8698 ms, mean = 28.3042 ms, median = 28.2299 ms, percentile(99%) = 29.7756 ms
[09/05/2022-05:57:21] [I] Enqueue Time: min = 6.81226 ms, max = 13.2745 ms, mean = 11.6344 ms, median = 11.8583 ms, percentile(99%) = 12.2503 ms
[09/05/2022-05:57:21] [I] H2D Latency: min = 0.133667 ms, max = 0.196533 ms, mean = 0.140916 ms, median = 0.138596 ms, percentile(99%) = 0.188202 ms
[09/05/2022-05:57:21] [I] GPU Compute Time: min = 27.6479 ms, max = 29.6464 ms, mean = 28.0755 ms, median = 28.0064 ms, percentile(99%) = 29.5215 ms
[09/05/2022-05:57:21] [I] D2H Latency: min = 0.065918 ms, max = 0.110596 ms, mean = 0.0748275 ms, median = 0.0733948 ms, percentile(99%) = 0.108398 ms
[09/05/2022-05:57:21] [I] Total Host Walltime: 3.05686 s
[09/05/2022-05:57:21] [I] Total GPU Compute Time: 3.03216 s
[09/05/2022-05:57:21] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/05/2022-05:57:21] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=best.onnx --int8

AastaLLL · September 6, 2022, 4:31am

Hi,

So you run both models with TensorRT but the FP16 model is slower, is that correct?

In general, if you want to test the int8 mode of TensorRT.
Please feed an original (ex. fp32) model and convert it with the --int8 flag.

Thanks.

srsjd · September 13, 2022, 4:22am

Hi @AastaLLL ,
I think numbers show the opposite of that.

The average GPU latency in the first image is around 26 ms for a FP16 engine. This engine was generated from a FP32 yolov5 onnx model which was sent to you (yolov5l-1214.onnx).

On the other hand, the average GPU latency in the next figure is around 28ms for a INT8 model. This INT8 model was generated from a FP32 yolov5l-QAT model which was also sent to you (best.onnx).

So the INT8 engine generated from quantized-aware-training (QAT) was slower than a normal FP16 engine. This is what confuses me.

Thanks.

AastaLLL · September 22, 2022, 3:57am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Sorry for the late reply.

Have you checked the INT8 mode with the yolov5 onnx model?
We want to compare the INT8 yolov5 and yolov5l-QAT first.

Thanks.

Topic		Replies	Views
QAT int8 TRT engine slower than fp16 TensorRT tensorrt , pytorch , python , onnx	3	2518	January 6, 2022
TensorRT generated QAT engine, why the engine is bigger than pretrained fp16 engine? TensorRT	3	1419	January 4, 2022
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	2116	June 14, 2021
Performance of QAT YOLOv7 model is worse? TensorRT	16	1201	August 3, 2023
Int8 TensorCores for Jetson Jetson AGX Xavier tensorrt	7	1453	April 26, 2023
TensorRT the inference is slow for the QAT model comparing to the PTQ case Jetson AGX Xavier tensorrt , nvbugs	19	1851	January 16, 2023
Int8 is not faster than fp16 on xavier Jetson AGX Xavier tensorrt	5	876	October 18, 2021
[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16 TensorRT tensorrt , python , onnx , natural-language-processing-nlp	4	3174	January 6, 2022
Inference Speed Jetson Xavier NX pytorch	6	1080	April 12, 2023
What's different .trt and .engine of model? Jetson AGX Xavier tensorrt	12	1865	November 24, 2021

How to verify if QAT TRT engine is indeed INT8 on Xavier

Related topics