How to verify if QAT TRT engine is indeed INT8 on Xavier

I converted a QAT model using trtexec on a xavier compute. There is no obvious improvement on inference time. I used nvprof to lookup actual tensorcores that were used. The nvprof log is attached. I can’t quite tell if they are fp32 or int8 tensorcores by their names. Can someone take a look at it and let me know if I have converted my model to INT8 engine correctly? Thanks.
qat-log.txt (281.8 KB)

Hi,

To convert a model into an INT8 engine, please make sure you have added the --int8 configuration.
Or the default mode which is fp32 will be used.

More, please remember to maximize the device performance before benchmarking.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Thanks. I did add --int8 configuration, and there is no speed improvement. I think there might be mistakes made during QAT and resulted the model can only be converted to a FP32 engine. However I hope to verify types of engines directly instead of using circumstantial evidence such as inference speed. How can I do that?

Hi,

Would you mind sharing the model with us so we can check it further?
Thanks.

best.trt (51.6 MB)
Thanks. Here it is and the platform is xavier jetpack 4.6

Hi,

Would you mind sharing the original ONNX file with us?
Thanks.

I have sent it to you though private message. Let me know if you have received it.

Can I assume that there is little information can be extracted from engine? I’m asking this because many nvidia hardware do not support INT8 as much as they support FP16. Some layers even perform faster as FP16 than as INT8. So the best QAT setup stragety seems to be using the PTQ-optimized network structure. Can we learn specifically what layers are quantized and what are not from the PTQ engine?

@AastaLLL Hello, do we know the cause of the low inference speed now?

@AastaLLL Hello, any updates?

Hi,

We are able to download your model and now are checking internally.
Will share more information with you later.

Thanks.

@AastaLLL Hi, what have we learnt so far? Were you able to reproduce the issue?

Hi,

We test your model on Xavier with the latest TensorRT 8.4 (JetPack 5.0.2) and below is the qps:

$ /usr/src/tensorrt/bin/trtexec --onnx=best.onnx --int8
...
[09/05/2022-07:30:19] [I] === Performance summary ===
[09/05/2022-07:30:19] [I] Throughput: 42.1906 qps
[09/05/2022-07:30:19] [I] Latency: min = 23.4788 ms, max = 23.9562 ms, mean = 23.6918 ms, median = 23.6862 ms, percentile(99%) = 23.9382 ms
[09/05/2022-07:30:19] [I] Enqueue Time: min = 2.93506 ms, max = 4.29883 ms, mean = 4.06823 ms, median = 4.17404 ms, percentile(99%) = 4.25513 ms
[09/05/2022-07:30:19] [I] H2D Latency: min = 0.13208 ms, max = 0.144775 ms, mean = 0.137683 ms, median = 0.137695 ms, percentile(99%) = 0.142853 ms
[09/05/2022-07:30:19] [I] GPU Compute Time: min = 23.2721 ms, max = 23.7436 ms, mean = 23.4812 ms, median = 23.4781 ms, percentile(99%) = 23.7234 ms
[09/05/2022-07:30:19] [I] D2H Latency: min = 0.0644531 ms, max = 0.0783691 ms, mean = 0.0728661 ms, median = 0.0727844 ms, percentile(99%) = 0.0778809 ms
[09/05/2022-07:30:19] [I] Total Host Walltime: 3.03385 s
[09/05/2022-07:30:19] [I] Total GPU Compute Time: 3.0056 s
[09/05/2022-07:30:19] [I] Explanations of the performance metrics are printed in the verbose logs.

Not sure which framework you compare to.
Could you share the performance from your side as well (both TensorRT and third-party library)?

Thanks.

@AastaLLL I compared the int8 model with original yolov5l fp16 model. Here are the results. As you can see the fp16 model is actually faster than the int8 model. I used jetpack 4.6 so results should be faster on your compute.

I’ve sent you the onnx file. Let me know if you could download it.

[09/05/2022-05:57:21] [I] === Trace details ===
[09/05/2022-05:57:21] [I] Trace averages of 10 runs:
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.128 ms - Host latency: 28.3464 ms (end to end 28.3566 ms, enqueue 11.9572 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.217 ms - Host latency: 28.43 ms (end to end 28.4389 ms, enqueue 11.3588 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.0949 ms - Host latency: 28.3138 ms (end to end 28.324 ms, enqueue 11.9387 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.1479 ms - Host latency: 28.3625 ms (end to end 28.373 ms, enqueue 11.4544 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.3071 ms - Host latency: 28.5275 ms (end to end 28.5706 ms, enqueue 11.9875 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 27.9723 ms - Host latency: 28.1838 ms (end to end 28.1932 ms, enqueue 11.8031 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.0657 ms - Host latency: 28.2825 ms (end to end 28.29 ms, enqueue 11.2075 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 27.809 ms - Host latency: 28.0188 ms (end to end 28.028 ms, enqueue 11.7692 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 28.082 ms - Host latency: 28.301 ms (end to end 28.3123 ms, enqueue 11.0856 ms)
[09/05/2022-05:57:21] [I] Average on 10 runs - GPU latency: 27.7799 ms - Host latency: 27.989 ms (end to end 28.0001 ms, enqueue 11.803 ms)
[09/05/2022-05:57:21] [I] 
[09/05/2022-05:57:21] [I] === Performance summary ===
[09/05/2022-05:57:21] [I] Throughput: 35.3304 qps
[09/05/2022-05:57:21] [I] Latency: min = 27.8552 ms, max = 29.8593 ms, mean = 28.2913 ms, median = 28.2172 ms, percentile(99%) = 29.7705 ms
[09/05/2022-05:57:21] [I] End-to-End Host Latency: min = 27.865 ms, max = 29.8698 ms, mean = 28.3042 ms, median = 28.2299 ms, percentile(99%) = 29.7756 ms
[09/05/2022-05:57:21] [I] Enqueue Time: min = 6.81226 ms, max = 13.2745 ms, mean = 11.6344 ms, median = 11.8583 ms, percentile(99%) = 12.2503 ms
[09/05/2022-05:57:21] [I] H2D Latency: min = 0.133667 ms, max = 0.196533 ms, mean = 0.140916 ms, median = 0.138596 ms, percentile(99%) = 0.188202 ms
[09/05/2022-05:57:21] [I] GPU Compute Time: min = 27.6479 ms, max = 29.6464 ms, mean = 28.0755 ms, median = 28.0064 ms, percentile(99%) = 29.5215 ms
[09/05/2022-05:57:21] [I] D2H Latency: min = 0.065918 ms, max = 0.110596 ms, mean = 0.0748275 ms, median = 0.0733948 ms, percentile(99%) = 0.108398 ms
[09/05/2022-05:57:21] [I] Total Host Walltime: 3.05686 s
[09/05/2022-05:57:21] [I] Total GPU Compute Time: 3.03216 s
[09/05/2022-05:57:21] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/05/2022-05:57:21] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --onnx=best.onnx --int8

Hi,

So you run both models with TensorRT but the FP16 model is slower, is that correct?

In general, if you want to test the int8 mode of TensorRT.
Please feed an original (ex. fp32) model and convert it with the --int8 flag.

Thanks.

Hi @AastaLLL ,
I think numbers show the opposite of that.

The average GPU latency in the first image is around 26 ms for a FP16 engine. This engine was generated from a FP32 yolov5 onnx model which was sent to you (yolov5l-1214.onnx).

On the other hand, the average GPU latency in the next figure is around 28ms for a INT8 model. This INT8 model was generated from a FP32 yolov5l-QAT model which was also sent to you (best.onnx).

So the INT8 engine generated from quantized-aware-training (QAT) was slower than a normal FP16 engine. This is what confuses me.

Thanks.

Hi,

Sorry for the late reply.

Have you checked the INT8 mode with the yolov5 onnx model?
We want to compare the INT8 yolov5 and yolov5l-QAT first.

Thanks.