Description
Hi,
I’m having trouble running inference with batch size > 1.
I’m building the network from Resnet-50 ONNX, loading it into my C++ project. When running inference with batch_size=1 everything is fine. When running inference with batch _size >1 I get empty output buffer for inference index 1,2,etc’ - although inference for index 0 is fine.
I’ve built the network with maximum batch of batch_size=5:
builder->setMaxBatchSize(batch_size);
I’ve assigned input / output buffers for batch_size images:
for (size_t i = 0; i < engine->getNbBindings(); ++i)
{
auto binding_size = getSizeByDim(engine->getBindingDimensions(i)) * batch_size * sizeof(float);
cudaMalloc(&buffers[i], binding_size);
if (engine->bindingIsInput(i))
{
input_dims.emplace_back(engine->getBindingDimensions(i));
}
else
{
output_dims.emplace_back(engine->getBindingDimensions(i));
}
}
I’ve activated the enqueue API with batch_size of 5:
context->enqueue(batch_size, buffers.data(), localStream, nullptr);
I’m reading enough of the output results:
std::vector cpu_output(getSizeByDim(dims) * batch_size);
cudaMemcpy(cpu_output.data(), gpu_output, cpu_output.size() * sizeof(float), cudaMemcpyDeviceToHost);
I’ve read a few posts on the topic of running inference of several images at a time, and couldn’t locate the issue in my code yet - assistance will be appreciated.
imagenet_classes.txt (21.2 KB) SampleFlow.cpp (17.0 KB)
Environment
Windows 10
TensorRT Version : 7.2.1.6.Windows10.x86_64.cuda-10.2.cudnn8.0
GPU Type : QUADRO M2000M
Nvidia Driver Version : 26.21.14.4122
CUDA Version : 10.2
CUDNN Version : cudnn-10.2-windows10-x64-v8.0.5.39
Operating System + Version :
Python Version (if applicable) :
TensorFlow Version (if applicable) :
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) :
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered
NVES
January 10, 2021, 12:07pm
2
Hi, Request you to share your model and script, so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
Thanks!
Thanks for the fast reply. Attached is a minimum running example.
The model itself is to too big to load, but it’s plain resnet-50 generated using PyTorch. I’ve uploaded the generating script.[Script.7z|attachment]
(upload://uR3qfN1hHB0pGBYKk8EuyVOF1WW.7z) (338.2 KB)
Code + ONNX model now shared on
Hi @amit.katzi ,
Could you please check the batch dim in the onnx input, please make sure that it is -1 when export to onnx. Set the batch dim as dynamic axis when export it to onnx.
For your reference,
opened 02:40AM - 17 Mar 20 UTC
closed 05:54AM - 30 Sep 20 UTC
Component: Caffe
Release: 7.x
## Description
I was used tensorrt5 well before in my program. Recently I upd… ate tensorrt5 to tensorrt7 as well in one image infer. While I change one input to batch input, I get a exceptional infer result. although I
change batch from 2 to other num, I only can one correct infer result. Others results always zeros.
I set batch setting in serializion with builder->setMaxBatchSize(mMaxBatchSize) and config->setMaxWorkspaceSize(10 << 20). How can solve this exception?
## Environment
C++ interface, caffe model parse
**TensorRT Version**: tensorrt7
**GPU Type**: GeForce GTX 1080TI
**Nvidia Driver Version**: 410.93
**CUDA Version**: cuda 10.0
**CUDNN Version**: cudnn 7.6.5
**Operating System + Version**: 16.04
**Python Version (if applicable)**:
**TensorFlow Version (if applicable)**:
**PyTorch Version (if applicable)**:
**Baremetal or Container (if container which image + tag)**:
## Relevant Files
## Steps To Reproduce
<!--
Craft a minimal bug report following this guide - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
Please include:
* Exact steps/commands to build your repro
* Exact steps/commands to run your repro
* Full traceback of errors encountered
-->fatch infer result mOutputIdx[i]:1
0.406353 -0.320382 1.5007 0.457494 0.0969265 1.50484 -0.886063 1.35644 0.245721 -0.121499 0.654094 0.441087 -0.0382616 0.438509 0.366007 0.687773 0.0161026 -0.937535 -0.152266 -0.808002 -1.0469 0.0923725 -0.0164187 -1.25007 -0.481815 -0.582204 0.163657 0.867953 0.869949 -0.551438 -0.458272 0.0660124 -1.14261 1.65907 -0.2369 -0.997435 -0.0952726 0.89377 1.01454 -0.236411 0.553196 0.154934 -0.653818 1.0422 -0.842711 -0.705634 1.79741 0.0375392 -0.844775 0.281589 1.48117 -0.668131 -0.00936072 0.405825 -0.0640269 -0.521449 -0.710661 0.0812242 -1.43366 0.781751 0.307415 -1.96069 0.175185 -0.834108 0.562669 -1.11267 0.225003 0.0494112 -0.113228 -0.737181 0.517156 0.27629 -2.01072 0.548 0.113696 0.20527 -1.06622 0.285509 0.733295 -1.97158 0.463969 -1.01816 -0.13495 -0.599245 -0.0708429 -0.527434 0.241324 -0.237106 -0.585738 -0.927675 -1.83781 -0.0975834 -0.229163 1.10344 0.992267 -0.807182 0.366262 -0.0900305 -0.183515 0.0356093 -0.92432 -2.05609 -0.027872 1.36764 -0.502808 -1.3497 -0.436312 -1.1614 -1.24928 0.295046 2.11463 -0.204547 0.510229 -0.913338 -1.47305 -0.464439 1.41911 -0.322183 0.12843 0.0386554 0.403745 0.274886 -1.51517 -0.425126 -0.144974 -1.19989 -0.492305 0.971405 -0.980981 0.226729 -0.962253 -1.11271 0.0819307 0.180224 0.548219 -1.19641 0.969759 -1.63952 -0.976513 -3.16699 0.0459704 0.815024 0.271491 -1.74563 0.0700583 -0.448591 -0.0798372 0.249779 0.339626 -1.28199 1.61843 -0.308285 1.13836 0.511789 -0.54121 -1.19866 0.129429 -0.173854 -0.358251 -0.55828 0.750858 -1.22212 -0.295723 -0.398226 -1.39757 -0.706725 -0.404669 -0.468153 -0.50701 -1.29585 0.994023 0.0881927 0.860236 -1.8756 -1.63142 -1.72427 0.54279 0.5903 0.409991 0.903088 -0.161422 -0.387014 1.30063 -0.609124 -2.22828 -0.480831 -0.728926 -0.578115 0.646032 0.596666 0.608126 -1.54893 -0.295164 -0.143483 -0.21124 1.0247 -0.376839 -0.694354 -0.590893 -0.837382 -0.444814 0.45696 -1.73967 -1.02547 -0.0128817 -0.613873 1.30854 -1.10418 1.15423 -0.315316 -2.0799 -0.478532 0.742416 0.16364 -0.382626 0.31556 0.17189 -1.33718 0.810278 -0.0816591 1.00955 -0.331771 -1.36839 -0.527535 2.24079 0.205742 0.807573 0.76596 -0.908307 0.387546 0.645354 -0.278659 -0.255232 1.709 -1.56609 0.652879 -0.634007 -0.667714 -0.164865 1.16015 -0.158875 0.20278 0.217261 -1.09209 0.594923 -0.274994 1.25857 -1.2823 0.256867 0.405666 0.860042 -0.231883 -0.616637 0.971031 0.274744 -0.399103 1.43199 0.82662 0.514961 0.842954 0.437559 0.565982 1.1632 0.808834 -1.87737 -0.370885 0.337025 0.831043 0.688389 0.595813 -0.460657 0.272426 -0.758543 -0.905068 0.235525 1.89656 -0.080765 -0.640284 -0.219413 0.140819 0.330152 1.03225 0.451335 1.73993 -0.0900418 0.452018 1.93943 0.821069 0.124759 -0.521162 0.0419835 -1.08423 -0.665381 -0.554943 -1.32223 -0.157837 -0.142827 0.384018 -1.4739 -0.89246 1.26983 -1.2753 -0.218849 -0.0530898 0.160818 0.141393 -0.417113 0.170634 -0.897994 -0.680975 0.279027 0.422929 -1.10184 1.02312 1.24519 -1.03106 0.635554 -0.815788 -1.06496 0.587361 -0.0302848 0.299898 -0.664726 0.523555 0.196908 -0.257223 0.638833 -0.839395 -1.4021 0.634566 -0.30287 0.467005 -0.212698 -0.918735 1.35413 -0.619448 1.67605 0.720621 -0.854841 1.79506 -0.392298 1.11476 1.9123 0.170381 0.364481 0.335852 0.513654 -1.57827 0.772346 -0.12321 -0.401701 0.637391 -0.336177 -0.613069 0.384869 -1.57921 0.557065 -0.557831 0.0483764 0.0879191 0.771807 1.10381 -0.0495069 -1.01708 -0.398865 -1.0098 0.323706 0.564023 -0.507627 0.485689 0.271754 -1.05546 -1.04395 -1.02332 -0.158644 -1.89779 0.706836 0.404797 0.19262 1.45922 1.01591 0.318038 0.0846937 0.423788 1.67103 -0.377224 1.43248 -1.68664 0.0705906 -1.44281 1.61475 -1.77278 0.278337 -0.387826 -1.1676 0.47104 -0.0523239 1.73991 0.612604 -1.35199 1.32312 -0.670404 0.617603 -0.939258 -0.816016 -0.140886 -0.347843 -1.16134 -0.137986 1.10472 -2.10966 -1.57219 0.759368 -0.603187 -0.911311 2.07928 -0.109702 -1.4553 0.148527 0.926771 1.01266 -1.53547 -0.64518 0.545981 -0.848883 -0.0265279 -0.539555 -0.575804 -0.333965 -0.479666 -0.307982 -1.15911 0.547309 0.433744 0.580764 -0.618615 1.15845 0.935896 0.40413 1.23144 -0.650113 0.57803 1.08238 1.14253 1.28084 -0.200051 -1.02007 -2.45203 0.294061 -0.0764502 -0.959355 0.306128 -0.665136 0.536571 -0.443366 1.16706 -0.0970458 0.0426857 1.29729 -0.734713 -0.585136 0.69516 -0.124344 -0.414979 0.986587 -0.0787226 -1.21431 1.71252 0.291973 -0.117967 -0.167183 0.246977 0.334573 -1.06436 -0.935838 -1.05446 -0.100502 -0.358832 1.82056 1.11902 0.798935 1.96519 -0.898557 0.139052 -1.01531 -1.34037 -1.12453 1.7604 0.0943001 0.838453 0.28292 -0.139534 0.165661 1.37864 -0.180487 -0.728953 1.49748 -0.448245 0.497511 1.00661 -0.403384 0.792762 -0.389688 -0.77524 0.551162 -1.72863 -0.0433265 -0.177881 1.2181 1.0489 -1.43918 0.654463 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Thank you.
Thanks for the advice @spolisetty .
The ONNX indeed had dim 1x3x224x224. I recreated the ONNX with dynamic input & output and input looks like x3x224x224.
After more fixes (like adding optimization profile) I was able to run inference over 5 images using a single ‘enqueue’ call.
I measured an improvement of 35% switching from batch_size 1 to batch_size 5.
I measured a similar gain over QUADROM2000M (fp32) and Xavier AGX (fp16).
I use the following optimization settings:
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, 3, 224, 224));
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(5, 3, 224, 224));
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(5, 3, 224, 224));
Checking the profiler on QUADROM2000M shows kernel efficiency is 25%, and it did not increase when going from batch_size 1 to batch_size 5.
Can you offer some advice to give better results (throughput) batch_size 5?
Hi @amit.katzi ,
Could you please let us know what is kernel efficiency
and the tool used to calculate this metric.
Thank you.
Hi @spolisetty ,
I’m using the Nsights Systems 2019.5.2 tool for profiling.
The metric I’m referring to is the ‘Theoretical Occupancy’ Nsight displays for the different kernels used when the network runs. All DNN kernels display the same 25% theoretical occupancy (running on QUADROM2000M)
When running on Xavier AGC, run-time is halved compared to QUADROM2000M due to using FP16 - so I estimate the occupancy is not higher there
Hi @amit.katzi ,
This is a nsys known issue, CUDA Occupancy Calculator shows 25%.
I am not sure if your ‘Theoretical Occupancy’ also get wrong.
Please check gpu utilization using nvidia-smi.
Thank you.