TensorRT running inference with batch size > 1

amit.katzi · January 10, 2021, 11:59am

Description

Hi,
I’m having trouble running inference with batch size > 1.
I’m building the network from Resnet-50 ONNX, loading it into my C++ project. When running inference with batch_size=1 everything is fine. When running inference with batch _size >1 I get empty output buffer for inference index 1,2,etc’ - although inference for index 0 is fine.

I’ve built the network with maximum batch of batch_size=5:
builder->setMaxBatchSize(batch_size);
I’ve assigned input / output buffers for batch_size images:
for (size_t i = 0; i < engine->getNbBindings(); ++i)
{
auto binding_size = getSizeByDim(engine->getBindingDimensions(i)) * batch_size * sizeof(float);
cudaMalloc(&buffers[i], binding_size);
if (engine->bindingIsInput(i))
{
input_dims.emplace_back(engine->getBindingDimensions(i));
}
else
{
output_dims.emplace_back(engine->getBindingDimensions(i));
}
}
I’ve activated the enqueue API with batch_size of 5:
context->enqueue(batch_size, buffers.data(), localStream, nullptr);
I’m reading enough of the output results:
std::vector cpu_output(getSizeByDim(dims) * batch_size);
cudaMemcpy(cpu_output.data(), gpu_output, cpu_output.size() * sizeof(float), cudaMemcpyDeviceToHost);

I’ve read a few posts on the topic of running inference of several images at a time, and couldn’t locate the issue in my code yet - assistance will be appreciated.

imagenet_classes.txt (21.2 KB) SampleFlow.cpp (17.0 KB)

Environment

Windows 10
TensorRT Version: 7.2.1.6.Windows10.x86_64.cuda-10.2.cudnn8.0
GPU Type: QUADRO M2000M
Nvidia Driver Version: 26.21.14.4122
CUDA Version: 10.2
CUDNN Version: cudnn-10.2-windows10-x64-v8.0.5.39
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

NVES · January 10, 2021, 12:07pm

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thanks!

amit.katzi · January 10, 2021, 12:57pm

Thanks for the fast reply. Attached is a minimum running example.

The model itself is to too big to load, but it’s plain resnet-50 generated using PyTorch. I’ve uploaded the generating script.[Script.7z|attachment]

(upload://uR3qfN1hHB0pGBYKk8EuyVOF1WW.7z) (338.2 KB)

amit.katzi · January 11, 2021, 5:30am

Code + ONNX model now shared on

spolisetty · January 11, 2021, 7:59am

Hi @amit.katzi,

Could you please check the batch dim in the onnx input, please make sure that it is -1 when export to onnx. Set the batch dim as dynamic axis when export it to onnx.

For your reference,

github.com/NVIDIA/TensorRT

tensorrt5 update to tensorrt7, batch infer exception

opened 02:40AM - 17 Mar 20 UTC

closed 05:54AM - 30 Sep 20 UTC

dtiny

Component: Caffe Release: 7.x

## Description I was used tensorrt5 well before in my program. Recently I upd…ate tensorrt5 to tensorrt7 as well in one image infer. While I change one input to batch input, I get a exceptional infer result. although I change batch from 2 to other num, I only can one correct infer result. Others results always zeros. I set batch setting in serializion with builder->setMaxBatchSize(mMaxBatchSize) and config->setMaxWorkspaceSize(10 << 20). How can solve this exception? ## Environment C++ interface, caffe model parse **TensorRT Version**: tensorrt7 **GPU Type**: GeForce GTX 1080TI **Nvidia Driver Version**: 410.93 **CUDA Version**: cuda 10.0 **CUDNN Version**: cudnn 7.6.5 **Operating System + Version**: 16.04 **Python Version (if applicable)**: **TensorFlow Version (if applicable)**: **PyTorch Version (if applicable)**: **Baremetal or Container (if container which image + tag)**: ## Relevant Files ## Steps To Reproduce fatch infer result mOutputIdx[i]:1 0.406353 -0.320382 1.5007 0.457494 0.0969265 1.50484 -0.886063 1.35644 0.245721 -0.121499 0.654094 0.441087 -0.0382616 0.438509 0.366007 0.687773 0.0161026 -0.937535 -0.152266 -0.808002 -1.0469 0.0923725 -0.0164187 -1.25007 -0.481815 -0.582204 0.163657 0.867953 0.869949 -0.551438 -0.458272 0.0660124 -1.14261 1.65907 -0.2369 -0.997435 -0.0952726 0.89377 1.01454 -0.236411 0.553196 0.154934 -0.653818 1.0422 -0.842711 -0.705634 1.79741 0.0375392 -0.844775 0.281589 1.48117 -0.668131 -0.00936072 0.405825 -0.0640269 -0.521449 -0.710661 0.0812242 -1.43366 0.781751 0.307415 -1.96069 0.175185 -0.834108 0.562669 -1.11267 0.225003 0.0494112 -0.113228 -0.737181 0.517156 0.27629 -2.01072 0.548 0.113696 0.20527 -1.06622 0.285509 0.733295 -1.97158 0.463969 -1.01816 -0.13495 -0.599245 -0.0708429 -0.527434 0.241324 -0.237106 -0.585738 -0.927675 -1.83781 -0.0975834 -0.229163 1.10344 0.992267 -0.807182 0.366262 -0.0900305 -0.183515 0.0356093 -0.92432 -2.05609 -0.027872 1.36764 -0.502808 -1.3497 -0.436312 -1.1614 -1.24928 0.295046 2.11463 -0.204547 0.510229 -0.913338 -1.47305 -0.464439 1.41911 -0.322183 0.12843 0.0386554 0.403745 0.274886 -1.51517 -0.425126 -0.144974 -1.19989 -0.492305 0.971405 -0.980981 0.226729 -0.962253 -1.11271 0.0819307 0.180224 0.548219 -1.19641 0.969759 -1.63952 -0.976513 -3.16699 0.0459704 0.815024 0.271491 -1.74563 0.0700583 -0.448591 -0.0798372 0.249779 0.339626 -1.28199 1.61843 -0.308285 1.13836 0.511789 -0.54121 -1.19866 0.129429 -0.173854 -0.358251 -0.55828 0.750858 -1.22212 -0.295723 -0.398226 -1.39757 -0.706725 -0.404669 -0.468153 -0.50701 -1.29585 0.994023 0.0881927 0.860236 -1.8756 -1.63142 -1.72427 0.54279 0.5903 0.409991 0.903088 -0.161422 -0.387014 1.30063 -0.609124 -2.22828 -0.480831 -0.728926 -0.578115 0.646032 0.596666 0.608126 -1.54893 -0.295164 -0.143483 -0.21124 1.0247 -0.376839 -0.694354 -0.590893 -0.837382 -0.444814 0.45696 -1.73967 -1.02547 -0.0128817 -0.613873 1.30854 -1.10418 1.15423 -0.315316 -2.0799 -0.478532 0.742416 0.16364 -0.382626 0.31556 0.17189 -1.33718 0.810278 -0.0816591 1.00955 -0.331771 -1.36839 -0.527535 2.24079 0.205742 0.807573 0.76596 -0.908307 0.387546 0.645354 -0.278659 -0.255232 1.709 -1.56609 0.652879 -0.634007 -0.667714 -0.164865 1.16015 -0.158875 0.20278 0.217261 -1.09209 0.594923 -0.274994 1.25857 -1.2823 0.256867 0.405666 0.860042 -0.231883 -0.616637 0.971031 0.274744 -0.399103 1.43199 0.82662 0.514961 0.842954 0.437559 0.565982 1.1632 0.808834 -1.87737 -0.370885 0.337025 0.831043 0.688389 0.595813 -0.460657 0.272426 -0.758543 -0.905068 0.235525 1.89656 -0.080765 -0.640284 -0.219413 0.140819 0.330152 1.03225 0.451335 1.73993 -0.0900418 0.452018 1.93943 0.821069 0.124759 -0.521162 0.0419835 -1.08423 -0.665381 -0.554943 -1.32223 -0.157837 -0.142827 0.384018 -1.4739 -0.89246 1.26983 -1.2753 -0.218849 -0.0530898 0.160818 0.141393 -0.417113 0.170634 -0.897994 -0.680975 0.279027 0.422929 -1.10184 1.02312 1.24519 -1.03106 0.635554 -0.815788 -1.06496 0.587361 -0.0302848 0.299898 -0.664726 0.523555 0.196908 -0.257223 0.638833 -0.839395 -1.4021 0.634566 -0.30287 0.467005 -0.212698 -0.918735 1.35413 -0.619448 1.67605 0.720621 -0.854841 1.79506 -0.392298 1.11476 1.9123 0.170381 0.364481 0.335852 0.513654 -1.57827 0.772346 -0.12321 -0.401701 0.637391 -0.336177 -0.613069 0.384869 -1.57921 0.557065 -0.557831 0.0483764 0.0879191 0.771807 1.10381 -0.0495069 -1.01708 -0.398865 -1.0098 0.323706 0.564023 -0.507627 0.485689 0.271754 -1.05546 -1.04395 -1.02332 -0.158644 -1.89779 0.706836 0.404797 0.19262 1.45922 1.01591 0.318038 0.0846937 0.423788 1.67103 -0.377224 1.43248 -1.68664 0.0705906 -1.44281 1.61475 -1.77278 0.278337 -0.387826 -1.1676 0.47104 -0.0523239 1.73991 0.612604 -1.35199 1.32312 -0.670404 0.617603 -0.939258 -0.816016 -0.140886 -0.347843 -1.16134 -0.137986 1.10472 -2.10966 -1.57219 0.759368 -0.603187 -0.911311 2.07928 -0.109702 -1.4553 0.148527 0.926771 1.01266 -1.53547 -0.64518 0.545981 -0.848883 -0.0265279 -0.539555 -0.575804 -0.333965 -0.479666 -0.307982 -1.15911 0.547309 0.433744 0.580764 -0.618615 1.15845 0.935896 0.40413 1.23144 -0.650113 0.57803 1.08238 1.14253 1.28084 -0.200051 -1.02007 -2.45203 0.294061 -0.0764502 -0.959355 0.306128 -0.665136 0.536571 -0.443366 1.16706 -0.0970458 0.0426857 1.29729 -0.734713 -0.585136 0.69516 -0.124344 -0.414979 0.986587 -0.0787226 -1.21431 1.71252 0.291973 -0.117967 -0.167183 0.246977 0.334573 -1.06436 -0.935838 -1.05446 -0.100502 -0.358832 1.82056 1.11902 0.798935 1.96519 -0.898557 0.139052 -1.01531 -1.34037 -1.12453 1.7604 0.0943001 0.838453 0.28292 -0.139534 0.165661 1.37864 -0.180487 -0.728953 1.49748 -0.448245 0.497511 1.00661 -0.403384 0.792762 -0.389688 -0.77524 0.551162 -1.72863 -0.0433265 -0.177881 1.2181 1.0489 -1.43918 0.654463 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Thank you.

amit.katzi · January 14, 2021, 12:18pm

Thanks for the advice @spolisetty.

The ONNX indeed had dim 1x3x224x224. I recreated the ONNX with dynamic input & output and input looks like x3x224x224.

After more fixes (like adding optimization profile) I was able to run inference over 5 images using a single ‘enqueue’ call.

I measured an improvement of 35% switching from batch_size 1 to batch_size 5.
I measured a similar gain over QUADROM2000M (fp32) and Xavier AGX (fp16).

I use the following optimization settings:
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, 3, 224, 224));
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(5, 3, 224, 224));
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(5, 3, 224, 224));

Checking the profiler on QUADROM2000M shows kernel efficiency is 25%, and it did not increase when going from batch_size 1 to batch_size 5.

Can you offer some advice to give better results (throughput) batch_size 5?

spolisetty · January 18, 2021, 4:42am

Hi @amit.katzi,

Could you please let us know what is kernel efficiency and the tool used to calculate this metric.

Thank you.

amit.katzi · January 19, 2021, 1:37pm

Hi @spolisetty,

I’m using the Nsights Systems 2019.5.2 tool for profiling.
The metric I’m referring to is the ‘Theoretical Occupancy’ Nsight displays for the different kernels used when the network runs. All DNN kernels display the same 25% theoretical occupancy (running on QUADROM2000M)
When running on Xavier AGC, run-time is halved compared to QUADROM2000M due to using FP16 - so I estimate the occupancy is not higher there

spolisetty · January 26, 2021, 6:56am

Hi @amit.katzi,

This is a nsys known issue, CUDA Occupancy Calculator shows 25%.
I am not sure if your ‘Theoretical Occupancy’ also get wrong.
Please check gpu utilization using nvidia-smi.

Thank you.

Topic		Replies	Views
Why different input size causes different performance? TensorRT	4	784	October 12, 2021
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1730	September 30, 2021
TensorRT --- non-int8 fallback when trying to calibrate ONNX model DeepStream SDK tensorrt , deepstream	11	439	July 1, 2024
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	764	March 13, 2023
ONNX model and TensorRT engine works differently TensorRT	5	745	February 20, 2023
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4394	April 12, 2022
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	915	March 13, 2023
ResNet18: Batch size 1 works, but batch size 10, 32 only has minor acceleration TensorRT	2	1774	February 20, 2020
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1150	January 19, 2022
Inference of model using tensorflow/onnxruntime and TensorRT gives different result Jetson TX2 tensorrt	20	2538	October 18, 2021

TensorRT running inference with batch size > 1

Description

Environment

Relevant Files

Steps To Reproduce

Related topics