Converted model is broken if half precision with dynamic batch size and batch size is greater than 1

cagdas · February 24, 2023, 12:36pm

Description

Converting fp16 ViT model from onnx to TensorRT with batch size>1 results in random model inference output. The conversion is successful, but the inference output is always roughly the same and random.
For example the output is for 4 images with broken converted model:

[[ 2.671875   -1.2802734  -0.42578125]]
[[ 2.6484375  -1.2753906  -0.43652344]]
[[ 2.6328125  -1.2939453  -0.40966797]]
[[ 2.6425781  -1.2939453  -0.42871094]]

The correct model output should be:

[[-0.18860717  3.2261243   0.13521218]]
[[-0.9554258  3.263251   0.7200781]]
[[-0.5815705   3.6151268  -0.17602095]]
[[-1.1649361   3.341591   -0.16794899]]

This issue only occurs

If the model is FP16
If dynamic batch size applied when converting to TensorRT. This is both applicable to trtexec conversion, and tritonserver tensorrt gpu_execution_accelerator with warmup batch_size>1 or first request with batch_size>1. warmup batch_size=1 and single image per request runs perfectly fine.

The same onnx model runs find in a local inference script or deployed with tritonserver regardless of the batch size and precision. Also I’m trying the same experiments with yolov7 segmentation model and both fp16 and fp32 models with dynamic batch size works properly.

Any idea why?
Any help would be appreciated. I want to deploy my model in half precision because the speed difference is huge:

onnx fp32 with tensorrt optimizations 
bs=1 55fps

onnx fp16 without tensorrt optimizations
bs=1 108fps

onnx fp16 with tensorrt optimizations 
bs=1 140fps

Environment

TensorRT Version: 8.5.2.2
GPU Type: 2080ti, 3080
Nvidia Driver Version: 525.89.02
CUDA Version: 12.0
CUDNN Version: 8.7.0
Operating System + Version: Fedora Linux 37.20230223.0 (Silverblue)
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 1.13.0
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:23.01-py3, nvcr.io/nvidia/tritonserver:23.01-py3

Relevant Files

Steps To Reproduce

I have a ViT model that I converted to onnx and then TensorRT with:

import torch
from timm import create_model

torch.hub._validate_not_a_forked_repo = lambda a, b, c: True

# load model
model = create_model(
	"vit_base_patch16_224",
	num_classes=3,
	in_chans=3,
	pretrained=True
)

model.cuda()
model.eval()
model.half()

x = torch.randn(1, 3, 224, 224, requires_grad=True).cuda().half()
torch_out = model(x)

torch.onnx.export(
	model,  # model being run
	x,  # model input (or a tuple for multiple inputs)
	"vit_base_patch16_224_half.onnx",  # where to save the model (can be a file or file-like object)
	export_params=True,  # store the trained parameter weights inside the model file
	opset_version=10,  # the ONNX version to export the model to
	do_constant_folding=True,  # whether to execute constant folding for optimization
	input_names=['input'],  # the model's input names
	output_names=['output'],  # the model's output names
	dynamic_axes={'input': {0: 'batch_size'},  # variable length axes
	              'output': {0: 'batch_size'}}
)

I’m using a custom finetuned model, but the same is applicable to pre-trained model provided by pytorch-image-models

Convert it to TensorRT with

trtexec --onnx=vit_base_patch16_224_half.onnx --minShapes=input:1x3x224x224 --optShapes=input:8x3x224x224  --maxShapes=input:32x3x224x224 --fp16 --saveEngine=vit_base_patch16_224_half.trt

When I use this model, both locally with a inference script that loads tensorrt engine, and deployed it with tritonserver, it always gives outputs like these for every image:
The output is for 4 images:

[[ 2.671875   -1.2802734  -0.42578125]]
[[ 2.6484375  -1.2753906  -0.43652344]]
[[ 2.6328125  -1.2939453  -0.40966797]]
[[ 2.6425781  -1.2939453  -0.42871094]]

The correct model output should be:

[[-0.18860717  3.2261243   0.13521218]]
[[-0.9554258  3.263251   0.7200781]]
[[-0.5815705   3.6151268  -0.17602095]]
[[-1.1649361   3.341591   -0.16794899]]

I also deployed the exact same model with tritonserver with/without gpu_execution_accelerator. Here is the tritonserver config:

name: "vitaction"
platform: "onnxruntime_onnx"
default_model_filename: "vit_base_patch16_224_half.onnx"
max_batch_size : 32
input [
  {
    name: "input"
    data_type: TYPE_FP16
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP16
    dims: [ 3 ]
  }
]
optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    parameters { key: "trt_engine_cache_enable" value: "true" }
    parameters { key: "trt_engine_cache_path" value: "/var/cache/trt_cache" }
    }
  ]}
}
model_warmup [{
  name: "warmup"
  batch_size: 1 # I'm changing this
  inputs: [{
    key: "input"
    value: {
      data_type: TYPE_FP16
      dims: [3, 224, 224]
      zero_data: true
    }
  }]
}]

If I change the batch_size in the model_warmup, or I sent a request for more than one image, the inference output is the same. If I convert the model with batch_size=1 with trtexec, or warmup with batch_size=1, the model generates proper outputs.

This issue only occurs if the model is half precision, and batch_size is bigger than 1. If I use the exact same script to convert a fp32 model and use every config/command exactly the same except fp16 flags, model outputs proper inference output regardless of the batch size.

I’m sure that this issue is not related to onnx conversion step because I can use the onnx model with a local script, and I can deploy the onnx model with tritonserver with the exact config minus gpu_execution_accelerator and model_warmup. Both local runs and tritonserver responses are perfectly fine.

I also tried another model, yolov7. The problem doesn’t occur for yolov7 model for neither precision.

NVES · February 24, 2023, 1:07pm

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.5 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!

cagdas · February 27, 2023, 9:41am

Hi,
Thanks for the quick reply. I’ve added verbose log and my model.

I’ve checked operators and every operator is supported for both fp32 and fp16 as far as I can see. As I’ve said, converting the same model in FP32 mode is working.

Here is the verbose log:
trtexec_fp16.log (1.7 MB)

Here is my model: model_best.pth.tar - Google Drive

Thanks.

cagdas · March 8, 2023, 8:47am

Hi @NVES,

Any chance looking into this?

Thanks!

nick_93 · March 20, 2023, 7:55pm

@NVES , @cagdas
Have you solved the issue?
I have faced same problem for VIT model, could not convert to TensorRT 16fp properly…

cagdas · March 21, 2023, 8:56pm

Hi @nick_93,

Unfortunately, no. I disabled tensorrt optimizations for the time being. If I can, I’ll certainly share from here.

cagdas · April 5, 2023, 8:27am

Bumping up the thread again @NVES . Any help would be appreciated.

AakankshaS · May 29, 2023, 3:15am

Hi @cagdas ,
Apologies for delayed response, can you please try the latest TRT version and confirm if the issue persist?

cagdas · June 7, 2023, 8:31am

@AakankshaS No worries about the delay. The issue got fixed. I tried out the latest TRT version like you suggested, and it’s running smooth now. Thanks for the help!

smyang03 · September 7, 2023, 3:38pm

I have the same problem. Trt version 8.6 cuda is 12.0. Can you tell me which version it was? And don’t you still optimize after changing the version? I’d appreciate your help

neltherion · August 11, 2024, 1:25pm

I’m having the same problem and the problem feels as if it arises from mixing TensorRT with ViT architecture and it seems TensorRT’s ViT conversion is somehow faulty🤔

963249642 · October 18, 2024, 11:44am

We encountered the same problem, and the use of the new version of Tensorrt really solved the problem that the model was greater than 1 caused by the damage of the model, but the test showed that there seemed to be greater accuracy losses.

Topic		Replies	Views
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1150	January 19, 2022
Some PyTorch model with slicing operation fails on inference TensorRT tensorrt , pytorch , onnx , deepstream	2	1463	January 7, 2022
TensorRT's OnnxParser problem TensorRT tensorrt	6	2341	October 12, 2021
TensorRT Batch Inference: different results TensorRT	4	4237	December 1, 2021
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1886	June 14, 2021
Inference fp16 engine in c++ get Nan output but inference fp32 engine can get correct result TensorRT	13	1342	October 10, 2023
Deploy Object Detection TF-TRT INT8 with DS Triton DeepStream SDK inference-server-triton	16	1308	October 12, 2021
Cannot convert SSD ONNX model to TensorRT TensorRT tensorrt	15	2365	November 23, 2022
Torchvision Faster RCNN failed to convert to TensorRT engine TensorRT tensorrt , ubuntu , python	3	1448	October 5, 2023
Inswapper onnx model conversion to tensorrt model Jetson AGX Orin tensorrt , onnx	29	990	January 8, 2025

Converted model is broken if half precision with dynamic batch size and batch size is greater than 1

Description

Environment

Relevant Files

Steps To Reproduce

Related topics