Inference with TensorRT after training Yolo v4 with TLT 3.0

Hi all,

After training a YOLO v4 model with TLT 3.0 and exporting it, I have some issue to perform inference using TensorRT in Python.

The model has been successfully trained, validated and tested in TLT 3.0. I could export the model and deploy it within a DeepStream 5 application.

Now I also need to deploy the model using TensorRT with the Python API. The first thing I did was to compile the TRT OSS to be able to use the BatchedNMS plugin required by the model.

I converted it using the tlt-converter utility (for CUDA 10.2, CUDNN 8 and TRT 7.1) using this command:

tlt-converter -k nvidia_tlt \
                   -d 3,768,1024 \
                   -o BatchedNMS \
                   -e model_fp32.engine \
                   -m 1 \
                   -t fp32 \
                   -i nchw \
                    yolov4.etlt

(You can download an etlt model file here: Dropbox - File Deleted. The model is trained to detect 15 classes. )

I wrote a piece of code to

  • deserialize the TRT engine file that was created by the DeepStream application from the etlt file exported from TLT.
  • load and pre-process the data
  • copy the data to the gpu
  • perform the inference
  • get the data back from the gpu

I could manage to get the code run without errors, but the output of the inference does not seem right.

There are 4 outputs, as expected:

  • number of detections (single int)
  • the bounding box coordinates (array)
  • the scores of each object
  • the class labels.

But, even when the number of detection is > 0, the bounding box of coordinates returns an array filled with 0. That’s the same issue for the scores and class labels: arrays filled with only 0.

In the PGIE configuration file used by the DeepStream app (that runs as expected) there are some properties related to the input of the model, such as the offsets, the colour format and the input dimension:

[property]
...
offsets=103.939;116.779;123.68
net-scale-factor=1
#0=RGB, 1=BGR
model-color-format=1
infer-dims=3;768;1024
batch-size=1
num-detected-classes=15
...

I have applied those in the code I developed as you can see in the snippet
below:

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(TRT_LOGGER,'')
DTYPE_TRT = trt.float32
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import Image
import numpy as np

path_img = "image.jpg"
offsets  = ( 103.939, 116.779, 123.68 )
yolo_reso = (3, 768, 1024)

# Simple helper data class that's a little nicer to use than a 2-tuple
# from TRT Python sample code
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        #dtype = DTYPE_TRT
        print(dtype)
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings

def load_input(img_path, host_buffer):
    # convert to BGR and CHW format
    with Image.open(img_path) as img:
        # RGB to BGR
        r, g, b = img.split()              
        img = Image.merge('RGB', (b, g, r))

        c, h, w = yolo_reso
        dtype = trt.nptype(DTYPE_TRT) 
        img_res = img.resize((w, h), Image.BICUBIC)
        img_res = np.array(img_res, dtype=dtype, order='C')

        # HWC to CHW format:
        img_chw = np.transpose(img_res, [2, 0, 1])
       
        # Applying offsets to BGR channels
        img_chw[0] = img_chw[0] - offsets[0]
        img_chw[1] = img_chw[1] - offsets[1]
        img_chw[2] = img_chw[2] - offsets[2]

        img_array = img_chw.ravel()
        np.copyto(host_buffer, img_array)

# Inference
with open("model_fp32.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    
    with engine.create_execution_context() as context:

        # allocate buffers
        inputs, outputs, bindings = allocate_buffers(engine)
        stream = cuda.Stream()

        # load image and pre-processing
        load_input(path_img, inputs[0].host)

        # transfer input data to the GPU.
        cuda.memcpy_htod_async(inputs[0].device, inputs[0].host, stream)
        
        # inference
        inference = context.execute_async(batch_size=1, bindings=bindings, stream_handle=stream.handle)
        
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
        
        # Synchronize the stream
        stream.synchronize()
        
        # Print the host output:
        print("OUTPUT")
        print(outputs)

A sample output of the code is given below:

OUTPUT
[Host:
[42]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e3a0>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e490>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e580>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e670>]

I have tested different images and always get the same type of output. Any idea why I only get 0 (also the size of the 3 array is always the same)?

I am wondering if my issue is related to the pre-processing of the data? Or am I missing something else for the inference?

I tried different type of pre-processing (with and without the offsets, BGR and RGB format, dividing the pixel values by 255…) and I always get the same type of output (only the number of detection varies).

Also, it is not very clear from the DeepStream documentation, but does the offsets apply to the BGR channels or the RGB channels? I.e. should the first component of the offsets apply to the R channel of the B channel of the input image?

Please note that when I use the etlt or TensorRT engine file in DeepStream, it works without any issue. But unfortunately I cannot use DeepStream for this work.

Thanks,

Johan

(PS: I have already asked that question in the TensorRT forum, but was redirected here - YOLO v4 inference with TensorRT after training with TLT 3.0 - #4 by johan_b)

Environment

TensorRT Version: 7.1.3
GPU Type: Titan V
Nvidia Driver Version: 455.45.01
CUDA Version: 10.2
CUDNN Version: 8
Operating System + Version: Ubuntu 18.04 LTS
Python Version (if applicable): 3.6.9
Baremetal or Container (if container which image + tag): Baremetal

Please refer to Inferring detectnet_v2 .trt model in python - #46 by Morganh and Inferring Yolo_v3.trt model in python

BTW, model-color-format should be “0” for RGB and “1” for BGR configuration.

#0=RGB, 1=BGR
model-color-format=1

Hi Morganh,

I am not too sure what you mean with the model-color-format. I understand 0 is for RGB and 1 for BGR. It is set to 1 in my DeepStreamm app PGIE config file (that’s the value given in the sample DeepStream deployment file for Yolo V4).

The model works great in DeepStream, so that’s not really the issue here.

But unfortunately cannot use DeepStream for the piece of work I’m doing.

I have tried to use TensorRT engine generated by DeepStream and by tlt-converter from the same etlt file exported by TLT. I still have the same issue of outputs filled with ‘0’.

I checked the data type for the binding (as you suggested here Inferring detectnet_v2 .trt model in python - #46 by Morganh):

Input
<class 'numpy.float32'>
BatchedNMS
<class 'numpy.int32'>
BatchedNMS_1
<class 'numpy.float32'>
BatchedNMS_2
<class 'numpy.float32'>
BatchedNMS_3
<class 'numpy.float32'>

I also went through the other post your linked.

To make sure it was not a RGB → BGR conversion issue with PIL, I switched to OpenCV to read the image.

The offsets are applied on their respective channel as well.

To make sure it is not an issue with resizing (ie not preserving aspect ratio), I used a 1024x768 input image.

In TLT, I tried again the export to an etlt then convert to a TRT engine. After that I tested the engine using tlt yolo_v4 inference and it works as expected. When using the etlt file in a Deepstream apps, it works as well.

But when I try to use the TensorRT engine in a Python application, it only outputs a bunch of 0 for the last 3 outputs (bounding box coordinates, classes and and probabilities).

Here is the output I currently get on a sample image:

OUTPUT
[Host:
[10]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f79d61d5210>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f79d61d5300>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f79d61d53f0>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]

I don’t know what else I can try to solve that issue.

I have put the code, a sample image and the etlt file here:

The etlt can using converted with:

./tlt-converter -k nvidia_tlt  \
                -d 3,768,1024 \
                -o BatchedNMS \
                -e trt.engine \
                -m 1 \
                -t fp32 \
                -i nchw \
                 yolov4.etlt

I have also tested the model from GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream but I still have the same issue.

Thanks,

Johan

As I mentioned, please go through Inferring Yolo_v3.trt model in python. You can refer to other user’s code and also pay attention to my comments on it.
Inferring Yolo_v3.trt model in python - #26 by Morganh

Hi Morganh,

I’ve done that already. I have looked at the other user’s code, adapted mine, tried many different things… but still no luck.

I tried to simplify my code as much as possible to find the bug and I’m not even looking at post-processing the outputs of the model.

Even if the pre-processing was off, the model should spit some random predictions. So far, there is nothing being returned. Only '0’s. Except for the first output giving the number of detected object. So the model seems to detects objects, but I cannot get the coordinates, classes and scores from the engine.

I ran the model using trt-exec with random inputs and it generates outputs without zeros.

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=model.etlt_b1_gpu0_fp32.engine --batch=1 --verbose --dumpOutput
[04/10/2021-16:07:04] [I] === Model Options ===
[04/10/2021-16:07:04] [I] Format: *
[04/10/2021-16:07:04] [I] Model: 
[04/10/2021-16:07:04] [I] Output:
[04/10/2021-16:07:04] [I] === Build Options ===
[04/10/2021-16:07:04] [I] Max batch: 1
[04/10/2021-16:07:04] [I] Workspace: 16 MB
[04/10/2021-16:07:04] [I] minTiming: 1
[04/10/2021-16:07:04] [I] avgTiming: 8
[04/10/2021-16:07:04] [I] Precision: FP32
[04/10/2021-16:07:04] [I] Calibration: 
[04/10/2021-16:07:04] [I] Safe mode: Disabled
[04/10/2021-16:07:04] [I] Save engine: 
[04/10/2021-16:07:04] [I] Load engine: model.etlt_b1_gpu0_fp32.engine
[04/10/2021-16:07:04] [I] Builder Cache: Enabled
[04/10/2021-16:07:04] [I] NVTX verbosity: 0
[04/10/2021-16:07:04] [I] Inputs format: fp32:CHW
[04/10/2021-16:07:04] [I] Outputs format: fp32:CHW
[04/10/2021-16:07:04] [I] Input build shapes: model
[04/10/2021-16:07:04] [I] Input calibration shapes: model
[04/10/2021-16:07:04] [I] === System Options ===
[04/10/2021-16:07:04] [I] Device: 0
[04/10/2021-16:07:04] [I] DLACore: 
[04/10/2021-16:07:04] [I] Plugins:
[04/10/2021-16:07:04] [I] === Inference Options ===
[04/10/2021-16:07:04] [I] Batch: 1
[04/10/2021-16:07:04] [I] Input inference shapes: model
[04/10/2021-16:07:04] [I] Iterations: 10
[04/10/2021-16:07:04] [I] Duration: 3s (+ 200ms warm up)
[04/10/2021-16:07:04] [I] Sleep time: 0ms
[04/10/2021-16:07:04] [I] Streams: 1
[04/10/2021-16:07:04] [I] ExposeDMA: Disabled
[04/10/2021-16:07:04] [I] Spin-wait: Disabled
[04/10/2021-16:07:04] [I] Multithreading: Disabled
[04/10/2021-16:07:04] [I] CUDA Graph: Disabled
[04/10/2021-16:07:04] [I] Skip inference: Disabled
[04/10/2021-16:07:04] [I] Inputs:
[04/10/2021-16:07:04] [I] === Reporting Options ===
[04/10/2021-16:07:04] [I] Verbose: Enabled
[04/10/2021-16:07:04] [I] Averages: 10 inferences
[04/10/2021-16:07:04] [I] Percentile: 99
[04/10/2021-16:07:04] [I] Dump output: Enabled
[04/10/2021-16:07:04] [I] Profile: Disabled
[04/10/2021-16:07:04] [I] Export timing to JSON file: 
[04/10/2021-16:07:04] [I] Export output to JSON file: 
[04/10/2021-16:07:04] [I] Export profile to JSON file: 
[04/10/2021-16:07:04] [I] 
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::Proposal version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[04/10/2021-16:07:04] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[04/10/2021-16:07:07] [V] [TRT] Deserialize required 2134079 microseconds.
[04/10/2021-16:07:07] [I] Starting inference threads
[04/10/2021-16:07:10] [I] Warmup completed 10 queries over 200 ms
[04/10/2021-16:07:10] [I] Timing trace has 143 queries over 3.05057 s
[04/10/2021-16:07:10] [I] Trace averages of 10 runs:
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1925 ms - Host latency: 22.7567 ms (end to end 42.1527 ms, enqueue 2.09218 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1672 ms - Host latency: 22.7313 ms (end to end 41.3656 ms, enqueue 2.08724 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1753 ms - Host latency: 22.7399 ms (end to end 41.8365 ms, enqueue 2.09033 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1688 ms - Host latency: 22.7332 ms (end to end 41.7161 ms, enqueue 2.10307 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1861 ms - Host latency: 22.7493 ms (end to end 41.2734 ms, enqueue 2.09435 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1469 ms - Host latency: 22.7116 ms (end to end 42.0307 ms, enqueue 2.09911 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1796 ms - Host latency: 22.7439 ms (end to end 41.1115 ms, enqueue 2.14235 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1604 ms - Host latency: 22.7264 ms (end to end 42.0915 ms, enqueue 2.08475 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.2231 ms - Host latency: 22.786 ms (end to end 39.8281 ms, enqueue 2.06548 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1515 ms - Host latency: 22.7164 ms (end to end 41.9878 ms, enqueue 2.08599 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1398 ms - Host latency: 22.7041 ms (end to end 41.3046 ms, enqueue 2.09263 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1191 ms - Host latency: 22.684 ms (end to end 41.9827 ms, enqueue 2.09614 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1869 ms - Host latency: 22.7502 ms (end to end 39.6555 ms, enqueue 2.07324 ms)
[04/10/2021-16:07:10] [I] Average on 10 runs - GPU latency: 21.1604 ms - Host latency: 22.7251 ms (end to end 41.7611 ms, enqueue 2.08916 ms)
[04/10/2021-16:07:10] [I] Host Latency
[04/10/2021-16:07:10] [I] min: 22.6125 ms (end to end 22.7795 ms)
[04/10/2021-16:07:10] [I] max: 23.1746 ms (end to end 42.5667 ms)
[04/10/2021-16:07:10] [I] mean: 22.7323 ms (end to end 41.4483 ms)
[04/10/2021-16:07:10] [I] median: 22.7281 ms (end to end 42.0558 ms)
[04/10/2021-16:07:10] [I] percentile: 22.8926 ms at 99% (end to end 42.3173 ms at 99%)
[04/10/2021-16:07:10] [I] throughput: 46.8765 qps
[04/10/2021-16:07:10] [I] walltime: 3.05057 s
[04/10/2021-16:07:10] [I] Enqueue Time
[04/10/2021-16:07:10] [I] min: 1.94055 ms
[04/10/2021-16:07:10] [I] max: 2.3136 ms
[04/10/2021-16:07:10] [I] median: 2.09122 ms
[04/10/2021-16:07:10] [I] GPU Compute
[04/10/2021-16:07:10] [I] min: 21.0483 ms
[04/10/2021-16:07:10] [I] max: 21.6146 ms
[04/10/2021-16:07:10] [I] mean: 21.168 ms
[04/10/2021-16:07:10] [I] median: 21.168 ms
[04/10/2021-16:07:10] [I] percentile: 21.3279 ms at 99%
[04/10/2021-16:07:10] [I] total compute time: 3.02702 s
[04/10/2021-16:07:10] [I] Output Tensors:
[04/10/2021-16:07:10] [I] BatchedNMS_3: (200)
[04/10/2021-16:07:10] [I] 11 13 0 10 8 1 0 0 13 4 0 11 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
[04/10/2021-16:07:10] [I] BatchedNMS_2: (200)
[04/10/2021-16:07:10] [I] 0.00680159 0.00605062 0.00541481 0.00515901 0.00274406 0.00217285 0.00159991 0.00156826 0.00113638 0.0010739 0.00106076 0.00102169 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[04/10/2021-16:07:10] [I] BatchedNMS_1: (200x4)
[04/10/2021-16:07:10] [I] 0.815778 0 1 0.434126 0.775406 0 1 0.453924 0.821989 0 1 0.480945 0.775406 0 1 0.453924 0.816268 0 1 0.544159 0.815778 0 1 0.434126 0.914096 0 1 0.587802 0.988862 0.0105914 1 0.415972 0.601418 0 0.99126 0.464936 0.815778 0 1 0.434126 0 0 0.0147054 0.279072 0.742937 0 1 0.571718 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[04/10/2021-16:07:10] [I] BatchedNMS: ()
[04/10/2021-16:07:10] [I] 12
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=model.etlt_b1_gpu0_fp32.engine --batch=1 --verbose --dumpOutput

Thanks,

Johan

I finally found the issue.

I did not copy all the data back from the GPU.

It should be:

# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
cuda.memcpy_dtoh_async(outputs[1].host, outputs[1].device, stream)
cuda.memcpy_dtoh_async(outputs[2].host, outputs[2].device, stream)
cuda.memcpy_dtoh_async(outputs[3].host, outputs[3].device, stream)

So problem solved :)

Thanks,

Johan

1 Like