TensorRT running inference with batch size > 1

Description

Hi,
I’m having trouble running inference with batch size > 1.
I’m building the network from Resnet-50 ONNX, loading it into my C++ project. When running inference with batch_size=1 everything is fine. When running inference with batch _size >1 I get empty output buffer for inference index 1,2,etc’ - although inference for index 0 is fine.

I’ve built the network with maximum batch of batch_size=5:
builder->setMaxBatchSize(batch_size);
I’ve assigned input / output buffers for batch_size images:
for (size_t i = 0; i < engine->getNbBindings(); ++i)
{
auto binding_size = getSizeByDim(engine->getBindingDimensions(i)) * batch_size * sizeof(float);
cudaMalloc(&buffers[i], binding_size);
if (engine->bindingIsInput(i))
{
input_dims.emplace_back(engine->getBindingDimensions(i));
}
else
{
output_dims.emplace_back(engine->getBindingDimensions(i));
}
}
I’ve activated the enqueue API with batch_size of 5:
context->enqueue(batch_size, buffers.data(), localStream, nullptr);
I’m reading enough of the output results:
std::vector cpu_output(getSizeByDim(dims) * batch_size);
cudaMemcpy(cpu_output.data(), gpu_output, cpu_output.size() * sizeof(float), cudaMemcpyDeviceToHost);

I’ve read a few posts on the topic of running inference of several images at a time, and couldn’t locate the issue in my code yet - assistance will be appreciated.

imagenet_classes.txt (21.2 KB) SampleFlow.cpp (17.0 KB)

Environment

Windows 10
TensorRT Version: 7.2.1.6.Windows10.x86_64.cuda-10.2.cudnn8.0
GPU Type: QUADRO M2000M
Nvidia Driver Version: 26.21.14.4122
CUDA Version: 10.2
CUDNN Version: cudnn-10.2-windows10-x64-v8.0.5.39
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi, Request you to share your model and script, so that we can help you better.

Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

Thanks!

Thanks for the fast reply. Attached is a minimum running example.

The model itself is to too big to load, but it’s plain resnet-50 generated using PyTorch. I’ve uploaded the generating script.[Script.7z|attachment]

(upload://uR3qfN1hHB0pGBYKk8EuyVOF1WW.7z) (338.2 KB)

Code + ONNX model now shared on

Hi @amit.katzi,

Could you please check the batch dim in the onnx input, please make sure that it is -1 when export to onnx. Set the batch dim as dynamic axis when export it to onnx.

For your reference,

Thank you.

Thanks for the advice @spolisetty.

The ONNX indeed had dim 1x3x224x224. I recreated the ONNX with dynamic input & output and input looks like x3x224x224.

After more fixes (like adding optimization profile) I was able to run inference over 5 images using a single ‘enqueue’ call.

I measured an improvement of 35% switching from batch_size 1 to batch_size 5.
I measured a similar gain over QUADROM2000M (fp32) and Xavier AGX (fp16).

I use the following optimization settings:
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, 3, 224, 224));
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(5, 3, 224, 224));
profile->setDimensions(“input”, nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(5, 3, 224, 224));

Checking the profiler on QUADROM2000M shows kernel efficiency is 25%, and it did not increase when going from batch_size 1 to batch_size 5.

Can you offer some advice to give better results (throughput) batch_size 5?

Hi @amit.katzi,

Could you please let us know what is kernel efficiency and the tool used to calculate this metric.

Thank you.

Hi @spolisetty,

I’m using the Nsights Systems 2019.5.2 tool for profiling.
The metric I’m referring to is the ‘Theoretical Occupancy’ Nsight displays for the different kernels used when the network runs. All DNN kernels display the same 25% theoretical occupancy (running on QUADROM2000M)
When running on Xavier AGC, run-time is halved compared to QUADROM2000M due to using FP16 - so I estimate the occupancy is not higher there

Hi @amit.katzi,

This is a nsys known issue, CUDA Occupancy Calculator shows 25%.
I am not sure if your ‘Theoretical Occupancy’ also get wrong.
Please check gpu utilization using nvidia-smi.

Thank you.