How can we know we have convert the onnx to int8trt rather than Float32?

Hi @530869411,

The input/output is still FP32 even for INT8 model unless reformatFree is used, so we cannot use the above code to check INT8.
Also there is no INT8 support for Jetson Nano, it is maxwell chip.

Thank you.

@spolisetty now i use the following code to create a fp16 model
onnx_to_tensorrt.py (2.8 KB)
and the inference code
inference-10.py (5.5 KB)
but i just found that fp16 is slower than fp32, Can you help me?

Hi @530869411,

We recommend you to please share both issue reproducible onnx models and scripts/steps to try from our end for better assistance.

Meanwhile, alternatively you can try running your model with trtexec command and share us –verbose logs.

Thank you.

@spolisetty
convert_to_onnx.py (505 Bytes)
F_TorrentNet.pth (2.0 MB)
F_TorrentNet.onnx (2.0 MB)
verbose.txt (45.1 KB)

@spolisetty @NVES

Hi @530869411,

Looks like you’ve shared single ONNX file. We request you to please share other model as well to compare performance of both models.

Thank you.

i just have onnx(fp32),and i want to through the code to convert onnx(fp32) to fp16trt, when i convert successful ,i flound it’s slower than fp32trt

F_Let.pth (276.3 KB)
F_Let_fp16.trt (156.1 KB)

Hi @530869411,

We are unable to reproduce this issue, we got FP16 is faster than FP32. We recommend you to please share trtexec --verbose logs of both for better debugging. FP16 should never be slower than FP32. In the worst case (i.e. all fp16 kernels are worse than fp32 kernels), TRT would just fallback to fp32.

For your reference, we can observe fp32 mean latency is 0.160975 and fp16 is 0.136759 which tells fp16 is faster.

fp32

[05/26/2021-17:30:49] [I] Host Latency
[05/26/2021-17:30:49] [I] min: 0.152832 ms (end to end 0.163513 ms)
[05/26/2021-17:30:49] [I] max: 4.63269 ms (end to end 4.64197 ms)
[05/26/2021-17:30:49] [I] mean: 0.160975 ms (end to end 0.169979 ms)
[05/26/2021-17:30:49] [I] median: 0.158691 ms (end to end 0.16748 ms)
[05/26/2021-17:30:49] [I] percentile: 0.174606 ms at 99% (end to end 0.184784 ms at 99%)
[05/26/2021-17:30:49] [I] throughput: 5181.99 qps
[05/26/2021-17:30:49] [I] walltime: 3.00039 s
[05/26/2021-17:30:49] [I] Enqueue Time
[05/26/2021-17:30:49] [I] min: 0.143646 ms
[05/26/2021-17:30:49] [I] max: 4.62085 ms
[05/26/2021-17:30:49] [I] median: 0.147705 ms
[05/26/2021-17:30:49] [I] GPU Compute
[05/26/2021-17:30:49] [I] min: 0.139648 ms
[05/26/2021-17:30:49] [I] max: 4.61816 ms
[05/26/2021-17:30:49] [I] mean: 0.147407 ms
[05/26/2021-17:30:49] [I] median: 0.145508 ms
[05/26/2021-17:30:49] [I] percentile: 0.160156 ms at 99%
[05/26/2021-17:30:49] [I] total compute time: 2.29189 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --verbose

fp16

[05/26/2021-17:28:57] [I] Host Latency
[05/26/2021-17:28:57] [I] min: 0.130615 ms (end to end 0.143433 ms)
[05/26/2021-17:28:57] [I] max: 3.4176 ms (end to end 3.42505 ms)
[05/26/2021-17:28:57] [I] mean: 0.136759 ms (end to end 0.151333 ms)
[05/26/2021-17:28:57] [I] median: 0.136841 ms (end to end 0.150391 ms)
[05/26/2021-17:28:57] [I] percentile: 0.149658 ms at 99% (end to end 0.162842 ms at 99%)
[05/26/2021-17:28:57] [I] throughput: 6425.03 qps
[05/26/2021-17:28:57] [I] walltime: 3.0003 s
[05/26/2021-17:28:57] [I] Enqueue Time
[05/26/2021-17:28:57] [I] min: 0.107452 ms
[05/26/2021-17:28:57] [I] max: 3.40491 ms
[05/26/2021-17:28:57] [I] median: 0.111328 ms
[05/26/2021-17:28:57] [I] GPU Compute
[05/26/2021-17:28:57] [I] min: 0.116943 ms
[05/26/2021-17:28:57] [I] max: 3.40259 ms
[05/26/2021-17:28:57] [I] mean: 0.122423 ms
[05/26/2021-17:28:57] [I] median: 0.123291 ms
[05/26/2021-17:28:57] [I] percentile: 0.132812 ms at 99%
[05/26/2021-17:28:57] [I] total compute time: 2.35996 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --fp16 --verbose

Thank you.

[quote=“spolisetty, post:14, topic:178479, full:true”]
Hi @530869411,

We are unable to reproduce this issue, we got FP16 is faster than FP32. We recommend you to please share trtexec --verbose logs of both for better debugging. FP16 should never be slower than FP32. In the worst case (i.e. all fp16 kernels are worse than fp32 kernels), TRT would just fallback to fp32.

For your reference, we can observe fp32 mean latency is 0.160975 and fp16 is 0.136759 which tells fp16 is faster.

fp32

[05/26/2021-17:30:49] [I] Host Latency
[05/26/2021-17:30:49] [I] min: 0.152832 ms (end to end 0.163513 ms)
[05/26/2021-17:30:49] [I] max: 4.63269 ms (end to end 4.64197 ms)
[05/26/2021-17:30:49] [I] mean: 0.160975 ms (end to end 0.169979 ms)
[05/26/2021-17:30:49] [I] median: 0.158691 ms (end to end 0.16748 ms)
[05/26/2021-17:30:49] [I] percentile: 0.174606 ms at 99% (end to end 0.184784 ms at 99%)
[05/26/2021-17:30:49] [I] throughput: 5181.99 qps
[05/26/2021-17:30:49] [I] walltime: 3.00039 s
[05/26/2021-17:30:49] [I] Enqueue Time
[05/26/2021-17:30:49] [I] min: 0.143646 ms
[05/26/2021-17:30:49] [I] max: 4.62085 ms
[05/26/2021-17:30:49] [I] median: 0.147705 ms
[05/26/2021-17:30:49] [I] GPU Compute
[05/26/2021-17:30:49] [I] min: 0.139648 ms
[05/26/2021-17:30:49] [I] max: 4.61816 ms
[05/26/2021-17:30:49] [I] mean: 0.147407 ms
[05/26/2021-17:30:49] [I] median: 0.145508 ms
[05/26/2021-17:30:49] [I] percentile: 0.160156 ms at 99%
[05/26/2021-17:30:49] [I] total compute time: 2.29189 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --verbose

fp16

[05/26/2021-17:28:57] [I] Host Latency
[05/26/2021-17:28:57] [I] min: 0.130615 ms (end to end 0.143433 ms)
[05/26/2021-17:28:57] [I] max: 3.4176 ms (end to end 3.42505 ms)
[05/26/2021-17:28:57] [I] mean: 0.136759 ms (end to end 0.151333 ms)
[05/26/2021-17:28:57] [I] median: 0.136841 ms (end to end 0.150391 ms)
[05/26/2021-17:28:57] [I] percentile: 0.149658 ms at 99% (end to end 0.162842 ms at 99%)
[05/26/2021-17:28:57] [I] throughput: 6425.03 qps
[05/26/2021-17:28:57] [I] walltime: 3.0003 s
[05/26/2021-17:28:57] [I] Enqueue Time
[05/26/2021-17:28:57] [I] min: 0.107452 ms
[05/26/2021-17:28:57] [I] max: 3.40491 ms
[05/26/2021-17:28:57] [I] median: 0.111328 ms
[05/26/2021-17:28:57] [I] GPU Compute
[05/26/2021-17:28:57] [I] min: 0.116943 ms
[05/26/2021-17:28:57] [I] max: 3.40259 ms
[05/26/2021-17:28:57] [I] mean: 0.122423 ms
[05/26/2021-17:28:57] [I] median: 0.123291 ms
[05/26/2021-17:28:57] [I] percentile: 0.132812 ms at 99%
[05/26/2021-17:28:57] [I] total compute time: 2.35996 s
&&&& PASSED TensorRT.trtexec # trtexec --onnx=F_TorrentNet.onnx --fp16 --verbose

Thank you. [/quote]

but when i use the .trt to do inference .fp16.trt inference is slower than fp32, in some model is slower than fp32

Hi @530869411,

Could you please share us issue reproducible inference script (includes time calculation) to try from our end.

Thank you.

@spolisetty hi the above inference-10.py is my inference code ,you could download the above code to test. you could download F_Let.pth and F_Let_fp16.trt and inference-10.py to test,what’s more i found that inference on Lenet fp16 is faster but in MLP is slower.
F_MLP.pth (940.4 KB)
F_MLP_fp16.trt (484.2 KB)

Hi @530869411,

We tried running convert_to_onnx.py. But facing some errors. We recommend you to please share only ONNX model, so that we will generate FP16 and FP32 engines and verify the performance to reproduce the issue.

For your info, we need to execute the conversion on the machine on which we will run inference. This is because TensorRT optimizes the graph by using the available GPUs and thus the generated engine cannot be used on different platform.

Thank you.

1 Like

thank you very much
F_MLP.onnx (932.7 KB)
F_XMLP.onnx (4.1 MB)

@spolisetty

Hi @530869411,

Sorry for the delayed response. We tried running converting model to trt and run inference script to test the performance, but facing many errors. And it looks like you’re calculating time for loading pytorch model and trt engine at a time. We recommend you to write separate code to calculate only time for inference on trt engine to get correct calculation.

when we run using trtexec, couldn’t reproduce the issue (fp16 is slower than fp32).

Thank you.

@spolisetty thank you very much for help me so many times,Now i am trying to do tensorrt QAT,Could you tell me ,where i can find some example?

Hi @530869411,

Hope following will help you,
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work-with-qat-networks

Thank you.

@spolisetty Does tensor support RT ?or tensorRT only have postQT

Hi @530869411,

Looks like we have same query on new thread. Please follow up here

Thank you.