Time of inference in FP16 and FP32 is the same

Using a TX2 NX to build and run a TRT engine. I have made my onnx model and that is being converted into a .trt engine file. I am using a basic MobilenetV2 model here.

The command to build the FP16 model and FP32 model

FP16

trtexec --onnx=onnx_model.onnx --saveEngine=TRTBS1.trt --explicitBatch --fp16

the output verbose in the model building were

[I] Precision: FP32 + FP16

FP32

trtexec --onnx=onnx_model.onnx --saveEngine=TRTBS1.trt --explicitBatch

the output verbose in the model building were

[I] Precision: FP32

The inference time for both the models is exactly the same. which basically means both models are the same fp32. am i right?
If so how do i improve my model performance further by moving the model to FP16 precision? because i have a little accuracy to spare rather that compute.

Hi,
Please refer to below links related custom plugin implementation and sample:

While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.

Thanks!

hey @NVES, I do not think i understand why i want to alter onnx graph cause i have no custom layers its just the standard MobileNetV2.
I need to know if the .trt engine can be made faster by moving to FP16 precision instead of FP32? If so what would the process be?

Hi,

Which version of the TensorRT are you using?
It’s possible if many layers end up falling back to FP32. TensorRT automatically chooses the best kernel out of available precisions.
Please check the verbose logs. And share with us the ONNX model and verbose logs.

Thank you.

TensorRT 8.0.1.6
sure will share the onnx model and the verbose log

1 Like

Here is the onnx model and the verbose
log.txt (1.0 MB)
onnx_model.onnx (8.5 MB)

Hi,

We could get FP16 faster than FP32.

FP16 (internally TRT takes FP16+FP32) logs:

=== Performance summary ===
[07/21/2022-14:10:40] [I] Throughput: 2251.24 qps
[07/21/2022-14:10:40] [I] Latency: min = 0.530518 ms, max = 1.86304 ms, mean = 0.567591 ms, median = 0.565918 ms, percentile(99%) = 0.594482 ms
[07/21/2022-14:10:40] [I] End-to-End Host Latency: min = 0.569336 ms, max = 1.93237 ms, mean = 0.788129 ms, median = 0.821899 ms, percentile(99%) = 0.861084 ms

FP32 logs:

=== Performance summary ===
[07/21/2022-14:54:35] [I] Throughput: 948.788 qps
[07/21/2022-14:54:35] [I] Latency: min = 1.1499 ms, max = 1.37012 ms, mean = 1.17979 ms, median = 1.17896 ms, percentile(99%) = 1.2085 ms
[07/21/2022-14:54:35] [I] End-to-End Host Latency: min = 1.17493 ms, max = 2.19727 ms, mean = 1.95792 ms, median = 1.9646 ms, percentile(99%) = 2.04865 ms

If we observe latency in FP16 is better than the FP32.

Thank you.

Hi, this is strange because i get much higher time i get around 7-9ms are you sure this is run on the Tx2 Nx?

@spolisetty can you share the trtexec build commands to verify.

Hi,

We have not verified on the TX2 NX. Please try the following commands on TX2 NX.
If you still face this issue, we would like to move this post to the TX2 NX forum to get better help.

FP16
/opt/tensorrt/bin/trtexec --onnx=onnx_model.onnx --verbose --workspace=5000 --fp16

FP32
/opt/tensorrt/bin/trtexec --onnx=onnx_model.onnx --verbose --workspace=5000

Thank you.

sure

FP16

[07/22/2022-14:45:38] [I] === Performance summary ===
[07/22/2022-14:45:38] [I] Throughput: 175.887 qps
[07/22/2022-14:45:38] [I] Latency: min = 5.52954 ms, max = 7.27454 ms, mean = 5.6736 ms, median = 5.65527 ms, percentile(99%) = 6.5177 ms

FP32

[07/22/2022-14:53:46] [I] === Performance summary ===
[07/22/2022-14:53:46] [I] Throughput: 164.369 qps
[07/22/2022-14:53:46] [I] Latency: min = 5.95654 ms, max = 8.1095 ms, mean = 6.07347 ms, median = 6.0498 ms, percentile(99%) = 6.71985 ms

is this all there is to gain ?

Hi,

We see some improvement in the case of FP16. To get more improvement, you can try increasing the workspace and also INT8.

Thank you.

sure will try and let you know

hey @spolisetty, increasing the workspace has not made any gain this might be because we have already used more than what is needed at max. the tx2 nx gpu doesnt support int8 i believe, because of which there is no time improvements when i ran the build in int8.

Hi,

As you said, TX2 doesn’t support INT8 operation.
Only FP32 and FP16 are available.

Have you maximized the device performance before the profiling?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

More, it’s also recommended to upgrade to TensorRT 8.2 which is included in the JetPack 4.6.2.
Thanks.

hey @AastaLLL yes the tx2 is running at max power and the jetson clocks were running. Unfortunately the jetpack4.6.2 bsp is not available from the carrier board manufacturer. Is there an alternative to this?

Hi,

Does JetPack 4.6.1 work for you?
Or you need the JetPack 4.6 for the device.

Thanks.

the latest one that is available and the one i am running on it the 4.6

Please contact with the board vendor to have them update to newer JetPack. Thanks