Not Getting Correct output while running inference using TensorRT on LPRnet fp16 Model

Hi,

I am getting output like below while running inference on LPRnet instead of numbers.

input: shape:(-1, 3, 48, 96) dtype:DataType.FLOAT
output: shape:(-1, 24) dtype:DataType.INT32
output: shape:(-1, 24) dtype:DataType.FLOAT
[array([35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35,
       35, 35, 35, 35, 35, 35, 35], dtype=int32), array([0.9999987 , 0.99999976, 1.        , 1.        , 1.        ,
       0.9999999 , 0.9999999 , 0.9999999 , 0.9999999 , 0.99999976,
       0.99999976, 0.99999976, 0.9999999 , 0.9999999 , 0.99999976,
       0.99999976, 0.99999976, 0.99999976, 0.99999976, 0.99999976,
       0.99999964, 0.9999993 , 0.9999987 , 0.99999964], dtype=float32)]

My Code-base is below :

import os
import time

import cv2
#import matplotlib.pyplot as plt
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
from PIL import Image
import pdb


class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()


def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine, batch_size=-1):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        # pdb.set_trace()
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
            print(f"input: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
            print(f"output: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
    return inputs, outputs, bindings, stream



def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(
        batch_size=batch_size, bindings=bindings, stream_handle=stream.handle
    )
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

# TensorRT logger singleton
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_engine_path = "number_plate_classification_b8_fp16.engine"

trt_runtime = trt.Runtime(TRT_LOGGER)
# pdb.set_trace()
trt_engine = load_engine(trt_runtime, trt_engine_path)
# Execution context is needed for inference
context = trt_engine.create_execution_context()
# This allocates memory for network inputs/outputs on both CPU and GPU
inputs, outputs, bindings, stream = allocate_buffers(trt_engine)

# pdb.set_trace()
image = cv2.imread("1626673361593_cropped_batch_code_image_imgGB3_BATOO69_.jpg")
image = cv2.resize(image, (96, 48))/255.0

image = image.T

np.copyto(inputs[0].host, image.ravel())

input_shape = (1,3,48,96)
context.set_binding_shape(0, input_shape)

output = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
print(output)

Please help me what should I add to get exact inference result.

Thanks.

Can you run default inference method and get the correct result?
See
https://docs.nvidia.com/tlt/tlt-user-guide/text/character_recognition/lprnet.html#running-inference-on-the-lprnet-model

  tlt lprnet inference -m <model>
                   -i <in_image_path>
                   -e <experiment_spec>
                   [-k <key>]
                   [--gpu_index <gpu_index>]
                   [--log_file <log_file>]
                   [--trt]

Thanks @Morganh for the response.

Actually I have successfully run this model in deep-stream application and getting result. But now I want to use NPR model with custom code (python opencv and pycuda) but with custom code I am getting issue mentioned above.

How can I get exact result out of do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

input: shape:(-1, 3, 48, 96) dtype:DataType.FLOAT
output: shape:(-1, 24) dtype:DataType.INT32
output: shape:(-1, 24) dtype:DataType.FLOAT
[array([35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35,
       35, 35, 35, 35, 35, 35, 35], dtype=int32), array([0.9999987 , 0.99999976, 1.        , 1.        , 1.        ,
       0.9999999 , 0.9999999 , 0.9999999 , 0.9999999 , 0.99999976,
       0.99999976, 0.99999976, 0.9999999 , 0.9999999 , 0.99999976,
       0.99999976, 0.99999976, 0.99999976, 0.99999976, 0.99999976,
       0.99999964, 0.9999993 , 0.9999987 , 0.99999964], dtype=float32)]

Can you modify

image = cv2.resize(image, (96, 48))/255.0
image = image.T

to

image = np.array([(cv2.resize(img, ( 96 , 48 )))/ 255.0 for img in image], dtype=np.float32)

image= image.transpose( 0 , 3 , 1 , 2 )

Yes I have made this changes but getting Error:

image= image.transpose( 0 , 3 , 1 , 2 )
ValueError: axes don’t match array

I still recommend you run tlt lprnet inference --trt xxx against the engine number_plate_classification_b8_fp16.engine to check if it can work.

Okay,

But same engine file is working with the deep-stream application.

Seems it is for batch 8.

Yes,

It is for batch 8. So you mean I should generate engine for b1 and then retry ?

tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,4x3x48x96,16x3x48x96 ./us_lprnet_baseline18_deployable.etltunpruned.etlt -t fp16 -e /opt/nvidia/deepstream/deepstream-5.0/samples/models/LP/LPR/lpr_us_onnx_b16.engine

Please follow GitHub - NVIDIA-AI-IOT/deepstream_lpr_app: Sample app code for LPR deployment on DeepStream

./tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,4x3x48x96,16x3x48x96 \
           models/LP/LPR/us_lprnet_baseline18_deployable.etlt -t fp16 -e models/LP/LPR/lpr_us_onnx_b16.engine

With Batch-Size 1 :

./tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,1x3x48x96,1x3x48x96 \
           models/LP/LPR/us_lprnet_baseline18_deployable.etlt -t fp16 -e models/LP/LPR/lpr_us_onnx_b16.engine

I am getting Error :

Traceback (most recent call last):
  File "inference_trt_npr.py", line 84, in <module>
    inputs, outputs, bindings, stream = allocate_buffers(trt_engine)
  File "inference_trt_npr.py", line 44, in allocate_buffers
    host_mem = cuda.pagelocked_empty(size, dtype)
pycuda._driver.MemoryError: cuMemHostAlloc failed: out of memory

with batch-size 16 :

./tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,4x3x48x96,16x3x48x96 \
           models/LP/LPR/us_lprnet_baseline18_deployable.etlt -t fp16 -e models/LP/LPR/lpr_us_onnx_b16.engine

I am getting error:

Traceback (most recent call last):
  File "inference_trt_npr.py", line 92, in <module>
    image= image.transpose( 0 , 3 , 1 , 2 )
ValueError: axes don't match array

I have read that there is a plugin needed in case of NPR we need extra plugin.
characters id sequence. (DeepStream post-process plugin is needed to get the final license plate)

So how we can use this in custom code base.? and the issue is because I did not add plugin or something else.

Thanks.

Where did you run below command?

./tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,1x3x48x96,1x3x48x96 \
           models/LP/LPR/us_lprnet_baseline18_deployable.etlt -t fp16 -e models/LP/LPR/lpr_us_onnx_b16.engine

And where did you run your standalone python script for inference?

I run it from where my tlt-converter file was and in python script I gave the full path of engine generated through the command.

./tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,1x3x48x96,1x3x48x96 \
           models/LP/LPR/us_lprnet_baseline18_deployable.etlt -t fp16 -e models/LP/LPR/lpr_us_onnx_b16.engine

Sorry, I mean which device did you run

./tlt-converter -k nvidia_tlt -p image_input,1x3x48x96,1x3x48x96,1x3x48x96
models/LP/LPR/us_lprnet_baseline18_deployable.etlt -t fp16 -e models/LP/LPR/lpr_us_onnx_b16.engine

I am using Jetson NX-Xavier and run it on NX-Xavier itself.

Thanks. And then you run the inference python script in xavier too, right?

Yes. I am doing all this process on NX-Xavier.

How did you download “tlt-converter” ? Did you download the correct version?

Yes I downloaded the correct one.

I am getting the inference result in deepstream application with the same engine file with batch-size-8 ,16 but the same engine file is giving result mentioned above in custom python code.

The Problem is with engine file ?

or what correction should I do in my custom code to get inference ?

Please modify

image = cv2.imread(“1626673361593_cropped_batch_code_image_imgGB3_BATOO69_.jpg”)

to

image = [cv2.imread(“1626673361593_cropped_batch_code_image_imgGB3_BATOO69_.jpg”)]