Falure to do inference


The attached c++ program is for LPRNET and it produces a failure when doing a inference.
How to produce a good inferred result?


TensorRT Version:
GPU Type: jetson xavier nx
Nvidia Driver Version: jetpack 4.5.1
CUDA Version: 10.2.89
CUDNN Version:
Operating System + Version: ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

nx4.5us_lprnet_baseline18_deployable.etlt_b16_gpu0_fp16.engine (31.6 MB)
trt_lprnet.cpp (5.6 KB)

The attached engine file is produced by the following commands:

wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/lprnet/versions/deployable_v1.0/files/us_lprnet_baseline18_deployable.etlt
./tao-converter -k nvidia_tlt -p image_input,1x3x48x96,4x3x48x96,16x3x48x96 us_lprnet_baseline18_deployable.etlt -t fp16 -e lpr_us_onnx_b16.engine

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

The file can be compiled/run by using:

g++ trt_lprnet.cpp -lnvinfer -Lcudart -pthread $(pkg-config --cflags --libs opencv4) -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda -lcudart -O0 -p -g 

Error messages:

a.out: trt_lprnet.cpp:62: void doInference(nvinfer1::IExecutionContext&, float*, float*, int): Assertion `engine.getNbBindings() == 2' failed.
Aborted (core dumped)

Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:


Unfortunately, I can not compile the mentioned trtexec successfully so far.
The attached model should be good because the model can work for the python code.
lprnet_simplest.py (3.4 KB)

The identical model works for the attached python code, but does not work for the attached c++ code. Therefore, I just want to find out a workable c++ code.


Looks like you’re using TAO toolkit LPRNET. We are moving this post to TAO forum to get better help.

Thank you.

I suggest you to take a look at GitHub - NVIDIA-AI-IOT/deepstream_lpr_app: Sample app code for LPR deployment on DeepStream for running inference.

More, please note that there are two outputs for lprnet. See below

$ python -m pip install colored
$ python -m pip install polygraphy --index-url https://pypi.ngc.nvidia.com
$ polygraphy inspect model model.plan
[I] Loading bytes from /workspace/model.plan
[I] ==== TensorRT Engine ====
Name: Unnamed Network 0 | Explicit Batch Engine (35 layers)

---- 1 Engine Input(s) ----
{image_input [dtype=float32, shape=(-1, 3, 48, 96)]}

---- 2 Engine Output(s) ----
{tf_op_layer_ArgMax [dtype=int32, shape=(-1, 24)],
tf_op_layer_Max [dtype=float32, shape=(-1, 24)]}

---- Memory ----
Device Memory: 65249280 bytes

---- 1 Profile(s) (3 Binding(s) Each) ----

  • Profile: 0
    Binding Index: 0 (Input) [Name: image_input] | Shapes: min=(1, 3, 48, 96), opt=(4, 3, 48, 96), max=(16, 3, 48, 96)
    Binding Index: 1 (Output) [Name: tf_op_layer_ArgMax] | Shape: (-1, 24)
    Binding Index: 2 (Output) [Name: tf_op_layer_Max] | Shape: (-1, 24)

I modify something based on your code.
It can work. Please check.

#include <string>
#include <future>
#include <deque>
#include <fstream>
#include <iostream>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <mutex>
#include <stdio.h>
#include <cassert>
#include <opencv2/opencv.hpp>
#include "cuda_runtime_api.h"
#include <cuda.h>
#include "NvInfer.h"
#define DEVICE 0  // GPU id
#define BATCH_SIZE 1
static const int INPUT_H = 48;
static const int INPUT_W = 96;
static const int OUTPUT_SIZE = 24;
const char *INPUT_BLOB_NAME = "image_input";
const char *OUTPUT_BLOB_NAME_1 = "tf_op_layer_ArgMax";
const char *OUTPUT_BLOB_NAME_2 = "tf_op_layer_Max";
const std::string alphabet[] = {
    "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
    "A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
    "K", "L", "M", "N", "P", "Q", "R", "S", "T", "U",
    "V", "W", "X", "Y", "Z"
#define CHECK(status) \
        auto ret = (status);\
        if (ret != 0)\
            std::cerr << "Cuda failure: " << ret << std::endl;\
    } while (0)
class Logger : public nvinfer1::ILogger {
    void log(Severity severity, const char* msg) override {
        if (severity <= Severity::kWARNING){
            std::cout << msg << std::endl;
} logger;
void doInference(nvinfer1::IExecutionContext &context, float *input, int *output_1, float *output_2, int batchSize) {
    const nvinfer1::ICudaEngine &engine = context.getEngine();
    // Pointers to input and output device buffers to pass to engine.
    // Engine requires exactly IEngine::getNbBindings() number of buffers.
    assert(engine.getNbBindings() == 3);
    void *buffers[3];
    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // Note that indices are guaranteed to be less than IEngine::getNbBindings()
    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
    const int outputIndex_1 = engine.getBindingIndex(OUTPUT_BLOB_NAME_1);
    const int outputIndex_2 = engine.getBindingIndex(OUTPUT_BLOB_NAME_2);
    // Create GPU buffers on device
    CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * INPUT_H * INPUT_W * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex_1], batchSize * OUTPUT_SIZE * sizeof(float)));
    CHECK(cudaMalloc(&buffers[outputIndex_2], batchSize * OUTPUT_SIZE * sizeof(float)));
    // Create stream
    cudaStream_t stream;
    // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output_1, buffers[outputIndex_1], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    CHECK(cudaMemcpyAsync(output_2, buffers[outputIndex_2], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    // Release stream and buffers
int main(int argc, char *argv[]){
    char *trtModelStream{nullptr};
    size_t size{0};
    std::ifstream file("/workspace/demo_2.0/lprnet/lpr_us_onnx_b16.engine", std::ios::binary);
    if (file.good()) {
        file.seekg(0, file.end);
        size = file.tellg();
        file.seekg(0, file.beg);
        trtModelStream = new char[size];
        file.read(trtModelStream, size);
    std::cout << "size:" << size <<"\n";
    nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
    assert(runtime != nullptr);
    nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size);
    assert(engine != nullptr);
    nvinfer1::IExecutionContext *context = engine->createExecutionContext();
    assert(context != nullptr);
    static float data[BATCH_SIZE * 3 * INPUT_H * INPUT_W];
    cv::Mat img = cv::imread("/workspace/demo_2.0/lprnet/data/openalpr/train/image/wts-lg-000158.jpg");
    cv::Mat pr_img;
    cv::resize(img, pr_img, cv::Size(INPUT_W, INPUT_H), 0, 0, cv::INTER_CUBIC);
    int i = 0;
    for (int row = 0; row < INPUT_H; ++row) {
        uchar* uc_pixel = pr_img.data + row * pr_img.step;
        for (int col = 0; col < INPUT_W; ++col) {
            data[i + 2 * INPUT_H * INPUT_W] = ((float)uc_pixel[2] - 127.5)*0.003921568627451;
            data[i + INPUT_H * INPUT_W] = ((float)uc_pixel[1]-127.5)*0.003921568627451;
            data[i] = ((float)uc_pixel[0]-127.5)*0.003921568627451;
            uc_pixel += 3;
    // Run inference
    static int tf_op_layer_ArgMax[BATCH_SIZE * OUTPUT_SIZE];
    static float tf_op_layer_Max[BATCH_SIZE * OUTPUT_SIZE];
    auto start = std::chrono::system_clock::now();
    printf("running inference \n");
    doInference(*context, data, tf_op_layer_ArgMax, tf_op_layer_Max, BATCH_SIZE);
    auto end = std::chrono::system_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << "us" << std::endl;
    std::cout << std::endl;

    std::vector<float> preds;
    for (int i = 0; i < 24; ++i) {
    // remove repeat blank label
    int pre_c = preds[0];
    std::vector<int> no_repeat_blank_label;
    for (auto c: preds) {
        if (c == pre_c || c == 35) {
            if (c == 35) pre_c = c;
        pre_c = c;
    //print the character list
    std::string str;
    for (auto v: no_repeat_blank_label) {
        str += alphabet[v];
    // Destroy the engine
    return 0;
time g++ trt_lprnet.cpp -lnvinfer -Lcudart -pthread $(pkg-config --cflags --libs opencv4) -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda -lcudart -O0 -p -g && time ./a.out
nvcc -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcutil -lcudpp -lcuda -lcudart -c -o FSPB_main.o FSPB_main.cpp

By running your codes, I cannot get license plate number, and outputs of tf_op_layer_ArgMax containing all zeros. Can you share your workable model file and the testing image? Thanks a lot.

By the way, I got the following error message:

Parameter check failed at: engine.cpp::resolveSlots::1227, condition: allInputDimensionsSpecified(routine)


$ tao lprnet run /bin/bash
# wget https://api.ngc.nvidia.com/v2/models/nvidia/tao/lprnet/versions/deployable_v1.0/files/us_lprnet_baseline18_deployable.etlt
# tao-converter -k nvidia_tlt -p image_input,1x3x48x96,1x3x48x96,1x3x48x96 us_lprnet_baseline18_deployable.etlt -t fp16 -e lpr_us_onnx_b16.engine
# time g++ trt_lprnet.cpp -lnvinfer -Lcudart -pthread $(pkg-config --cflags --libs opencv) -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda -lcudart -O0 -p -g && time ./a.out

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.