YOLO v4 inference with TensorRT after training with TLT 3.0

Description

After training a YOLO v4 model with TLT 3.0 and exporting it, I have some issue to perform inference using TRT.

The model has been successfully trained, validated and tested in TLT 3.0. I could export the model and deploy it within a DeepStream 5 application.

Now I also need to deploy the model using TensorRT with the Python API. The first thing I did was to compile the TRT OSS to be able to use the BatchedNMS plugin required by the model.

I wrote a piece of code to

  • deserialize the TRT engine file that was created by the DeepStream application from the etlt file exported from TLT.
  • load and pre-process the data
  • copy the data to the gpu
  • perform the inference
  • get the data back from the gpu

I could manage to get the code run without errors, but the output of the inference does not seem right.

There are 4 outputs, as expected:

  • number of detections (single int)
  • the bounding box coordinates (array)
  • the scores of each object
  • the class labels.

But, even when the number of detection is > 0, the bounding box of coordinates returns an array filled with 0. That’s the same issue for the scores and class labels: arrays filled with only 0.

In the PGIE configuration file used by the DeepStream app (that runs as expected) there are some properties related to the input of the model, such as the offsets, the colour format and the input dimension:

[property]
...
offsets=103.939;116.779;123.68
net-scale-factor=1
#0=RGB, 1=BGR
model-color-format=1
infer-dims=3;768;1024
batch-size=1
...

I have applied those in the code I developed as you can see in the snippets below:

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(TRT_LOGGER,'')
DTYPE_TRT = trt.float32
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import Image
import numpy as np
path_img = "image.jpg"
offsets  = ( 103.939, 116.779, 123.68 )

# ...
# some helper functions from TRT sample Python code
# ...

def load_input(img_path, host_buffer):
    # convert to BGR and CHW format
    with Image.open(img_path) as img:
        # RGB to BGR
        r, g, b = img.split()              
        img = Image.merge('RGB', (b, g, r))

        c, h, w = (3, 768, 1024)
        dtype = trt.nptype(DTYPE_TRT) 
        img_res = img.resize((w, h), Image.BICUBIC)
        img_res = np.array(img_res, dtype=dtype, order='C')

        # HWC to CHW format:
        img_chw = np.transpose(img_res, [2, 0, 1])
       
        # Applying offsets to BGR channels
        img_chw[0] = img_chw[0] - offsets[0]
        img_chw[1] = img_chw[1] - offsets[1]
        img_chw[2] = img_chw[2] - offsets[2]

        img_array = img_chw.ravel()
        np.copyto(host_buffer, img_array)

# Inference
with open("model_fp32.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    
    with engine.create_execution_context() as context:

        # allocate buffers
        inputs, outputs, bindings = allocate_buffers(engine)
        stream = cuda.Stream()

        # load image and pre-processing
        load_input(path_img, inputs[0].host)

        # transfer input data to the GPU.
        cuda.memcpy_htod_async(inputs[0].device, inputs[0].host, stream)
        
        # inference
        inference = context.execute_async(batch_size=1, bindings=bindings, stream_handle=stream.handle)
        
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
        
        # Synchronize the stream
        stream.synchronize()
        
        # Print the host output:
        print("OUTPUT")
        print(outputs)

A sample output of the code is given below:

OUTPUT
[Host:
[42]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e3a0>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e490>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e580>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e670>]

I have tested different images and always get the same type of output.

I am wondering if my issue is related to the pre-processing of the data? Or am I missing something else for the inference?

Also, it is not very clear from the DeepStream documentation, but does the offsets apply to the BGR channels or the RGB channels? I.e. should the first component of the offsets apply to the R channel of the B channel?

Environment

TensorRT Version: 7.1.3
GPU Type: Titan V
Nvidia Driver Version: 455.45.01
CUDA Version: 10.2
CUDNN Version: 8
Operating System + Version: Ubuntu 18.04 LTS
Python Version (if applicable): 3.6.9
Baremetal or Container (if container which image + tag): Baremetal

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

Hi NVES,

Here is the output from trtexec (it seems to be ok?):

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=1 --verbose
[04/08/2021-21:39:13] [I] === Model Options ===
[04/08/2021-21:39:13] [I] Format: *
[04/08/2021-21:39:13] [I] Model: 
[04/08/2021-21:39:13] [I] Output:
[04/08/2021-21:39:13] [I] === Build Options ===
[04/08/2021-21:39:13] [I] Max batch: 1
[04/08/2021-21:39:13] [I] Workspace: 16 MB
[04/08/2021-21:39:13] [I] minTiming: 1
[04/08/2021-21:39:13] [I] avgTiming: 8
[04/08/2021-21:39:13] [I] Precision: FP32
[04/08/2021-21:39:13] [I] Calibration: 
[04/08/2021-21:39:13] [I] Safe mode: Disabled
[04/08/2021-21:39:13] [I] Save engine: 
[04/08/2021-21:39:13] [I] Load engine: model.engine
[04/08/2021-21:39:13] [I] Builder Cache: Enabled
[04/08/2021-21:39:13] [I] NVTX verbosity: 0
[04/08/2021-21:39:13] [I] Inputs format: fp32:CHW
[04/08/2021-21:39:13] [I] Outputs format: fp32:CHW
[04/08/2021-21:39:13] [I] Input build shapes: model
[04/08/2021-21:39:13] [I] Input calibration shapes: model
[04/08/2021-21:39:13] [I] === System Options ===
[04/08/2021-21:39:13] [I] Device: 0
[04/08/2021-21:39:13] [I] DLACore: 
[04/08/2021-21:39:13] [I] Plugins:
[04/08/2021-21:39:13] [I] === Inference Options ===
[04/08/2021-21:39:13] [I] Batch: 1
[04/08/2021-21:39:13] [I] Input inference shapes: model
[04/08/2021-21:39:13] [I] Iterations: 10
[04/08/2021-21:39:13] [I] Duration: 3s (+ 200ms warm up)
[04/08/2021-21:39:13] [I] Sleep time: 0ms
[04/08/2021-21:39:13] [I] Streams: 1
[04/08/2021-21:39:13] [I] ExposeDMA: Disabled
[04/08/2021-21:39:13] [I] Spin-wait: Disabled
[04/08/2021-21:39:13] [I] Multithreading: Disabled
[04/08/2021-21:39:13] [I] CUDA Graph: Disabled
[04/08/2021-21:39:13] [I] Skip inference: Disabled
[04/08/2021-21:39:13] [I] Inputs:
[04/08/2021-21:39:13] [I] === Reporting Options ===
[04/08/2021-21:39:13] [I] Verbose: Enabled
[04/08/2021-21:39:13] [I] Averages: 10 inferences
[04/08/2021-21:39:13] [I] Percentile: 99
[04/08/2021-21:39:13] [I] Dump output: Disabled
[04/08/2021-21:39:13] [I] Profile: Disabled
[04/08/2021-21:39:13] [I] Export timing to JSON file: 
[04/08/2021-21:39:13] [I] Export output to JSON file: 
[04/08/2021-21:39:13] [I] Export profile to JSON file: 
[04/08/2021-21:39:13] [I] 
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Proposal version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[04/08/2021-21:39:14] [V] [TRT] Deserialize required 855265 microseconds.
[04/08/2021-21:39:14] [I] Starting inference threads
[04/08/2021-21:39:17] [I] Warmup completed 20 queries over 200 ms
[04/08/2021-21:39:17] [I] Timing trace has 316 queries over 3.01907 s
[04/08/2021-21:39:17] [I] Trace averages of 10 runs:
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.60522 ms - Host latency: 11.1722 ms (end to end 18.4379 ms, enqueue 2.38806 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.60532 ms - Host latency: 11.1729 ms (end to end 18.4686 ms, enqueue 2.37536 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.51675 ms - Host latency: 11.0845 ms (end to end 18.7923 ms, enqueue 2.38184 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53139 ms - Host latency: 11.0985 ms (end to end 18.5644 ms, enqueue 2.37105 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52454 ms - Host latency: 11.0943 ms (end to end 18.7998 ms, enqueue 2.37631 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.54829 ms - Host latency: 11.1189 ms (end to end 18.608 ms, enqueue 2.39402 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.50683 ms - Host latency: 11.0739 ms (end to end 18.7834 ms, enqueue 2.37681 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52607 ms - Host latency: 11.0938 ms (end to end 18.7242 ms, enqueue 2.38451 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.5155 ms - Host latency: 11.0833 ms (end to end 18.7939 ms, enqueue 2.41888 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53549 ms - Host latency: 11.1026 ms (end to end 18.7437 ms, enqueue 2.38383 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.64969 ms - Host latency: 11.2172 ms (end to end 18.9746 ms, enqueue 2.41281 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.56407 ms - Host latency: 11.1328 ms (end to end 18.2136 ms, enqueue 2.38313 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.51993 ms - Host latency: 11.0866 ms (end to end 18.7584 ms, enqueue 2.38447 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53993 ms - Host latency: 11.1071 ms (end to end 18.7939 ms, enqueue 2.37832 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53353 ms - Host latency: 11.1016 ms (end to end 18.6047 ms, enqueue 2.375 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52291 ms - Host latency: 11.0902 ms (end to end 18.8013 ms, enqueue 2.3837 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53189 ms - Host latency: 11.1006 ms (end to end 18.6207 ms, enqueue 2.3755 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52517 ms - Host latency: 11.0925 ms (end to end 18.6595 ms, enqueue 2.37561 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53262 ms - Host latency: 11.1011 ms (end to end 18.5324 ms, enqueue 2.38126 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.5319 ms - Host latency: 11.0992 ms (end to end 18.8119 ms, enqueue 2.377 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52864 ms - Host latency: 11.0968 ms (end to end 18.7304 ms, enqueue 2.37883 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52002 ms - Host latency: 11.0883 ms (end to end 18.7924 ms, enqueue 2.37947 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53667 ms - Host latency: 11.1041 ms (end to end 18.6309 ms, enqueue 2.37495 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52307 ms - Host latency: 11.093 ms (end to end 18.7872 ms, enqueue 2.37766 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53164 ms - Host latency: 11.0994 ms (end to end 18.7299 ms, enqueue 2.38091 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52068 ms - Host latency: 11.0883 ms (end to end 18.7975 ms, enqueue 2.37576 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53113 ms - Host latency: 11.0993 ms (end to end 18.5183 ms, enqueue 2.37327 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52207 ms - Host latency: 11.0889 ms (end to end 18.7955 ms, enqueue 2.38027 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.58308 ms - Host latency: 11.1534 ms (end to end 18.0532 ms, enqueue 2.37944 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.50659 ms - Host latency: 11.0748 ms (end to end 18.7812 ms, enqueue 2.37905 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52559 ms - Host latency: 11.0926 ms (end to end 18.68 ms, enqueue 2.37747 ms)
[04/08/2021-21:39:17] [I] Host Latency
[04/08/2021-21:39:17] [I] min: 11.0056 ms (end to end 13.7264 ms)
[04/08/2021-21:39:17] [I] max: 12.3127 ms (end to end 20.0497 ms)
[04/08/2021-21:39:17] [I] mean: 11.1058 ms (end to end 18.672 ms)
[04/08/2021-21:39:17] [I] median: 11.0933 ms (end to end 18.7994 ms)
[04/08/2021-21:39:17] [I] percentile: 11.3425 ms at 99% (end to end 19.0951 ms at 99%)
[04/08/2021-21:39:17] [I] throughput: 104.668 qps
[04/08/2021-21:39:17] [I] walltime: 3.01907 s
[04/08/2021-21:39:17] [I] Enqueue Time
[04/08/2021-21:39:17] [I] min: 2.3114 ms
[04/08/2021-21:39:17] [I] max: 2.74396 ms
[04/08/2021-21:39:17] [I] median: 2.38004 ms
[04/08/2021-21:39:17] [I] GPU Compute
[04/08/2021-21:39:17] [I] min: 9.44214 ms
[04/08/2021-21:39:17] [I] max: 10.7449 ms
[04/08/2021-21:39:17] [I] mean: 9.53782 ms
[04/08/2021-21:39:17] [I] median: 9.52533 ms
[04/08/2021-21:39:17] [I] percentile: 9.77197 ms at 99%
[04/08/2021-21:39:17] [I] total compute time: 3.01395 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=1 --verbose

I’ll be posting the whole script in a separate reply.

Here is the full script (it’s quite basic):

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(TRT_LOGGER,'')
DTYPE_TRT = trt.float32
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import Image
import numpy as np

path_img = "image.jpg"
offsets  = ( 103.939, 116.779, 123.68 )
yolo_reso = (3, 768, 1024)

# Simple helper data class that's a little nicer to use than a 2-tuple
# from TRT Python sample code
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        #dtype = DTYPE_TRT
        print(dtype)
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings

def load_input(img_path, host_buffer):
    # convert to BGR and CHW format
    with Image.open(img_path) as img:
        # RGB to BGR
        r, g, b = img.split()              
        img = Image.merge('RGB', (b, g, r))

        c, h, w = yolo_reso
        dtype = trt.nptype(DTYPE_TRT) 
        img_res = img.resize((w, h), Image.BICUBIC)
        img_res = np.array(img_res, dtype=dtype, order='C')

        # HWC to CHW format:
        img_chw = np.transpose(img_res, [2, 0, 1])
       
        # Applying offsets to BGR channels
        img_chw[0] = img_chw[0] - offsets[0]
        img_chw[1] = img_chw[1] - offsets[1]
        img_chw[2] = img_chw[2] - offsets[2]

        img_array = img_chw.ravel()
        np.copyto(host_buffer, img_array)

# Inference
with open("model_fp32.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    
    with engine.create_execution_context() as context:

        # allocate buffers
        inputs, outputs, bindings = allocate_buffers(engine)
        stream = cuda.Stream()

        # load image and pre-processing
        load_input(path_img, inputs[0].host)

        # transfer input data to the GPU.
        cuda.memcpy_htod_async(inputs[0].device, inputs[0].host, stream)
        
        # inference
        inference = context.execute_async(batch_size=1, bindings=bindings, stream_handle=stream.handle)
        
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
        
        # Synchronize the stream
        stream.synchronize()
        
        # Print the host output:
        print("OUTPUT")
        print(outputs)

You can download an etlt model file here: Dropbox - File Deleted. The model is trained to detect 15 classes.

I converted it using the tlt-converter utility (for CUDA 10.2, CUDNN 8 and TRT 7.1) using this command:

tlt-converter -k nvidia_tlt \
                   -d 3,768,1024 \
                   -o BatchedNMS \
                   -e model_fp32.engine \
                   -m 1 \
                   -t fp32 \
                   -i nchw \
                    yolov4.etlt

If I use the etlt or TensorRT engine file in DeepStream, it works without any issue. But unfortunately I cannot use DeepStream for this work.

Any idea why I only get 0? I tried different type of pre-processing (with and without the offsets, BGR and RGB format, dividing the pixel values by 255…) and I always get the same type of output (only the number of detection varies).

Hi @johan_b,

We request you to post your concern in TLT forum to get better help. You may get more details on postprocessing part.

Thank you.

Hi spolisetty,

I just did: Inference with TensorRT after training Yolo v4 with TLT 3.0

Thanks,

Johan

1 Like

Hi again,

I finally found the issue: I was not transferring all the data back from the GPU.

# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
cuda.memcpy_dtoh_async(outputs[1].host, outputs[1].device, stream)
cuda.memcpy_dtoh_async(outputs[2].host, outputs[2].device, stream)
cuda.memcpy_dtoh_async(outputs[3].host, outputs[3].device, stream)

Thanks,

Johan

1 Like

Hi,
I’m getting the below error when I try to run the same code outside tlt docker.

[TensorRT] INTERNAL ERROR: Assertion failed: d == a + length
/opt/tensorrt/TensorRT/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp:70
Aborting…

Aborted (core dumped)

However, if I run the tensorrt engine inside TLT3.0 docker, I’m getting the required output. Could you please help me in resolving the above issue?