Converted model is broken if half precision with dynamic batch size and batch size is greater than 1


Converting fp16 ViT model from onnx to TensorRT with batch size>1 results in random model inference output. The conversion is successful, but the inference output is always roughly the same and random.
For example the output is for 4 images with broken converted model:

[[ 2.671875   -1.2802734  -0.42578125]]
[[ 2.6484375  -1.2753906  -0.43652344]]
[[ 2.6328125  -1.2939453  -0.40966797]]
[[ 2.6425781  -1.2939453  -0.42871094]]

The correct model output should be:

[[-0.18860717  3.2261243   0.13521218]]
[[-0.9554258  3.263251   0.7200781]]
[[-0.5815705   3.6151268  -0.17602095]]
[[-1.1649361   3.341591   -0.16794899]]

This issue only occurs

  • If the model is FP16
  • If dynamic batch size applied when converting to TensorRT. This is both applicable to trtexec conversion, and tritonserver tensorrt gpu_execution_accelerator with warmup batch_size>1 or first request with batch_size>1. warmup batch_size=1 and single image per request runs perfectly fine.

The same onnx model runs find in a local inference script or deployed with tritonserver regardless of the batch size and precision. Also I’m trying the same experiments with yolov7 segmentation model and both fp16 and fp32 models with dynamic batch size works properly.

Any idea why?
Any help would be appreciated. I want to deploy my model in half precision because the speed difference is huge:

onnx fp32 with tensorrt optimizations 
bs=1 55fps

onnx fp16 without tensorrt optimizations
bs=1 108fps

onnx fp16 with tensorrt optimizations 
bs=1 140fps


TensorRT Version:
GPU Type: 2080ti, 3080
Nvidia Driver Version: 525.89.02
CUDA Version: 12.0
CUDNN Version: 8.7.0
Operating System + Version: Fedora Linux 37.20230223.0 (Silverblue)
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 1.13.0
Baremetal or Container (if container which image + tag):,

Relevant Files

Steps To Reproduce

  1. I have a ViT model that I converted to onnx and then TensorRT with:
import torch
from timm import create_model

torch.hub._validate_not_a_forked_repo = lambda a, b, c: True

# load model
model = create_model(


x = torch.randn(1, 3, 224, 224, requires_grad=True).cuda().half()
torch_out = model(x)

	model,  # model being run
	x,  # model input (or a tuple for multiple inputs)
	"vit_base_patch16_224_half.onnx",  # where to save the model (can be a file or file-like object)
	export_params=True,  # store the trained parameter weights inside the model file
	opset_version=10,  # the ONNX version to export the model to
	do_constant_folding=True,  # whether to execute constant folding for optimization
	input_names=['input'],  # the model's input names
	output_names=['output'],  # the model's output names
	dynamic_axes={'input': {0: 'batch_size'},  # variable length axes
	              'output': {0: 'batch_size'}}

I’m using a custom finetuned model, but the same is applicable to pre-trained model provided by pytorch-image-models

  1. Convert it to TensorRT with
trtexec --onnx=vit_base_patch16_224_half.onnx --minShapes=input:1x3x224x224 --optShapes=input:8x3x224x224  --maxShapes=input:32x3x224x224 --fp16 --saveEngine=vit_base_patch16_224_half.trt

When I use this model, both locally with a inference script that loads tensorrt engine, and deployed it with tritonserver, it always gives outputs like these for every image:
The output is for 4 images:

[[ 2.671875   -1.2802734  -0.42578125]]
[[ 2.6484375  -1.2753906  -0.43652344]]
[[ 2.6328125  -1.2939453  -0.40966797]]
[[ 2.6425781  -1.2939453  -0.42871094]]

The correct model output should be:

[[-0.18860717  3.2261243   0.13521218]]
[[-0.9554258  3.263251   0.7200781]]
[[-0.5815705   3.6151268  -0.17602095]]
[[-1.1649361   3.341591   -0.16794899]]
  1. I also deployed the exact same model with tritonserver with/without gpu_execution_accelerator. Here is the tritonserver config:
name: "vitaction"
platform: "onnxruntime_onnx"
default_model_filename: "vit_base_patch16_224_half.onnx"
max_batch_size : 32
input [
    name: "input"
    data_type: TYPE_FP16
    dims: [ 3, 224, 224 ]
output [
    name: "output"
    data_type: TYPE_FP16
    dims: [ 3 ]
optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    parameters { key: "trt_engine_cache_enable" value: "true" }
    parameters { key: "trt_engine_cache_path" value: "/var/cache/trt_cache" }
model_warmup [{
  name: "warmup"
  batch_size: 1 # I'm changing this
  inputs: [{
    key: "input"
    value: {
      data_type: TYPE_FP16
      dims: [3, 224, 224]
      zero_data: true

If I change the batch_size in the model_warmup, or I sent a request for more than one image, the inference output is the same. If I convert the model with batch_size=1 with trtexec, or warmup with batch_size=1, the model generates proper outputs.

This issue only occurs if the model is half precision, and batch_size is bigger than 1. If I use the exact same script to convert a fp32 model and use every config/command exactly the same except fp16 flags, model outputs proper inference output regardless of the batch size.

I’m sure that this issue is not related to onnx conversion step because I can use the onnx model with a local script, and I can deploy the onnx model with tritonserver with the exact config minus gpu_execution_accelerator and model_warmup. Both local runs and tritonserver responses are perfectly fine.

I also tried another model, yolov7. The problem doesn’t occur for yolov7 model for neither precision.

Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:


Thanks for the quick reply. I’ve added verbose log and my model.

I’ve checked operators and every operator is supported for both fp32 and fp16 as far as I can see. As I’ve said, converting the same model in FP32 mode is working.

Here is the verbose log:
trtexec_fp16.log (1.7 MB)

Here is my model: model_best.pth.tar - Google Drive



Any chance looking into this?


@NVES , @cagdas
Have you solved the issue?
I have faced same problem for VIT model, could not convert to TensorRT 16fp properly…

Hi @nick_93,

Unfortunately, no. I disabled tensorrt optimizations for the time being. If I can, I’ll certainly share from here.

1 Like

Bumping up the thread again @NVES . Any help would be appreciated.

Hi @cagdas ,
Apologies for delayed response, can you please try the latest TRT version and confirm if the issue persist?

@AakankshaS No worries about the delay. The issue got fixed. I tried out the latest TRT version like you suggested, and it’s running smooth now. Thanks for the help!

I have the same problem. Trt version 8.6 cuda is 12.0. Can you tell me which version it was? And don’t you still optimize after changing the version? I’d appreciate your help