How to perform batch inference with explicit batch?


I can’t seem to find a clear example on how to perform batch inference using the explicit batch mode.
I see many outdated articles pointing to this example here, but looking at the code, it only uses a batch size of 1. Other examples I see use implicit batch mode, but this is now deprecated so I need an example demonstrating how to use explicit batch mode.

How can I use a batch size larger than 1?

I follow the sampleOnnxMNIST.cpp sample code to create the following:

You can assume that m_inputDims and m_outputDims are of type nvinfer1::Dims and already contain relevant information.

bool InferenceEngine::infer() {
//     Read the serialized model file
    std::ifstream file(m_enginePath, std::ios::binary | std::ios::ate);
    std::streamsize size = file.tellg();
    file.seekg(0, std::ios::beg);

    std::vector<char> buffer(size);
    if (!, size)) {
        throw std::runtime_error("Unable to read engine file");

    std::unique_ptr<IRuntime> runtime{createInferRuntime(m_logger)};
    if (!runtime) {
        return false;

    m_engine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(, buffer.size()));
    if (!m_engine) {
        return false;

    // Create RAII buffer manager object
    samplesCommon::BufferManager buffers(m_engine); // TODO can specify the batch size in this call.

    auto context = std::unique_ptr<nvinfer1::IExecutionContext>(m_engine->createExecutionContext());
    if (!context) {
        return false;

    size_t batchSize = 1;

    if (!processInput(buffers, batchSize)) {
        return false;

    // Memcpy from host input buffers to device input buffers

    bool status = context->executeV2(buffers.getDeviceBindings().data());
    if (!status) {
        return false;

    // Memcpy from device output buffers to host output buffers

    const int outputSize = m_outputDims.d[1];
    float* output = static_cast<float*>(buffers.getHostBuffer("2621"));

    for (int i = 0; i < outputSize; ++i) {
        std::cout << output[i] << " ";

    std::cout << "\n\n\n" << std::endl;
    return true;

And here is the definition for the processInput function:

bool InferenceEngine::processInput(const samplesCommon::BufferManager &buffers, size_t batchSize) {
    auto image = cv::imread("../img.jpg");
    if (image.empty()) {
        throw std::runtime_error("Could not load image");

    cv::cvtColor(image, image, cv::COLOR_BGR2RGB);

    const int inputH = m_inputDims.d[2];
    const int inputW = m_inputDims.d[3];

    // Preprocess code
    image.convertTo(image, CV_32FC3, 1.f / 255.f);
    cv::subtract(image, cv::Scalar(0.5f, 0.5f, 0.5f), image, cv::noArray(), -1);
    cv::divide(image, cv::Scalar(0.5f, 0.5f, 0.5f), image, 1, -1);

    float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("input.1"));

    int r = 0 , g = 0, b = 0;
    for (int i = 0; i < 112 * 112 * 3; ++i) {
        if (i % 3 == 0) {
            hostDataBuffer[r++] = *(reinterpret_cast<float*>( + i);
        } else if (i % 3 == 1) {
            hostDataBuffer[g++ + 112*112] = *(reinterpret_cast<float*>( + i);
        } else {
            hostDataBuffer[b++ + 112*112*2] = *(reinterpret_cast<float*>( + i);

    for (int i = 0; i < 30; ++i) {
        std::cout << hostDataBuffer[i] << " ";
    std::cout << "\n\n";

    return true;

For a batch size of 1, this works great. However, how would I adapt the above code to work for a batch size greater than 1? This call here float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer("input.1")); is only allocating enough memory for a single batch, so how do I ensure enough memory has been allocated for the number of batches I plan on running?

Additionally, something I am confused about, how does the call to bool status = context->executeV2(buffers.getDeviceBindings().data()); know the batch size, since we provide no argument which states how large the buffer being passed to the function call is.


TensorRT Version:
GPU Type: RTX 3080
Nvidia Driver Version: 465.19.01
CUDA Version: 11.3
CUDNN Version:
Operating System + Version: Ubuntu 20.04


Please refer to below link for working with dynamic shapes:

You can fine tune model using optimization profiles to specific input dim range

Following example may help you.

Thank you.

Thank you for the resources, they are helpful.
I’m following along with the “Digit Recognition With Dynamic Shapes In TensorRT” example and I have a bit of confusion.

Can you please explain the difference between optimization profiles vs calibration profiles?
The documentation doesn’t seem to explain what the difference is.


Optimization profile:

profile->setDimensions(input->getName(), OptProfileSelector::kMIN, Dims4{1, 1, 1, 1});
profile->setDimensions(input->getName(), OptProfileSelector::kOPT, Dims4{1, 1, 28, 28});
profile->setDimensions(input->getName(), OptProfileSelector::kMAX, Dims4{1, 1, 56, 56});

Calibration profile:

auto profileCalib = builder->createOptimizationProfile();
const int calibBatchSize{256};
profileCalib->setDimensions(input->getName(), OptProfileSelector::kMIN, Dims4{calibBatchSize, 1, 28, 28});
profileCalib->setDimensions(input->getName(), OptProfileSelector::kOPT, Dims4{calibBatchSize, 1, 28, 28});
profileCalib->setDimensions(input->getName(), OptProfileSelector::kMAX, Dims4{calibBatchSize, 1, 28, 28});

Additionally, can you speak more to selecting optimization profiles for runtime (beyond just linking me to the docs page).
The example you linked doesn’t select an optimization profile.

What if we are dealing with changing batch sizes.
Ex. We use a batch size of 16, then 8, then 1.
Do we need to change the optimization profile every time?
Will changing the profile impact performance?

How do we get the most performance in a situation like this?


  • Do we need to change the optimization profile every time?
    No, you can use the same optimization profile so long as the actual batch size is covered by the [min, max] batch sizes of that optimization profile when the engine is built.

  • Will changing the profile impact performance?
    Yes, there will be some perf overhead when changing profiles.

  • How do we get the most performance in a situation like this?
    Use the same optimization profile, but call context->setBindingDimensions() every time batch size changes

  • What is the difference between optimization profiles vs calibration profiles?
    calibration profiles are the profiles used in the calibration process only.
    Please refer Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

Thank you.