YOLO v4 inference with TensorRT after training with TLT 3.0

johan_b · April 8, 2021, 11:11am

Description

After training a YOLO v4 model with TLT 3.0 and exporting it, I have some issue to perform inference using TRT.

The model has been successfully trained, validated and tested in TLT 3.0. I could export the model and deploy it within a DeepStream 5 application.

Now I also need to deploy the model using TensorRT with the Python API. The first thing I did was to compile the TRT OSS to be able to use the BatchedNMS plugin required by the model.

I wrote a piece of code to

deserialize the TRT engine file that was created by the DeepStream application from the etlt file exported from TLT.
load and pre-process the data
copy the data to the gpu
perform the inference
get the data back from the gpu

I could manage to get the code run without errors, but the output of the inference does not seem right.

There are 4 outputs, as expected:

number of detections (single int)
the bounding box coordinates (array)
the scores of each object
the class labels.

But, even when the number of detection is > 0, the bounding box of coordinates returns an array filled with 0. That’s the same issue for the scores and class labels: arrays filled with only 0.

In the PGIE configuration file used by the DeepStream app (that runs as expected) there are some properties related to the input of the model, such as the offsets, the colour format and the input dimension:

[property]
...
offsets=103.939;116.779;123.68
net-scale-factor=1
#0=RGB, 1=BGR
model-color-format=1
infer-dims=3;768;1024
batch-size=1
...

I have applied those in the code I developed as you can see in the snippets below:

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(TRT_LOGGER,'')
DTYPE_TRT = trt.float32
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import Image
import numpy as np
path_img = "image.jpg"
offsets  = ( 103.939, 116.779, 123.68 )

# ...
# some helper functions from TRT sample Python code
# ...

def load_input(img_path, host_buffer):
    # convert to BGR and CHW format
    with Image.open(img_path) as img:
        # RGB to BGR
        r, g, b = img.split()              
        img = Image.merge('RGB', (b, g, r))

        c, h, w = (3, 768, 1024)
        dtype = trt.nptype(DTYPE_TRT) 
        img_res = img.resize((w, h), Image.BICUBIC)
        img_res = np.array(img_res, dtype=dtype, order='C')

        # HWC to CHW format:
        img_chw = np.transpose(img_res, [2, 0, 1])
       
        # Applying offsets to BGR channels
        img_chw[0] = img_chw[0] - offsets[0]
        img_chw[1] = img_chw[1] - offsets[1]
        img_chw[2] = img_chw[2] - offsets[2]

        img_array = img_chw.ravel()
        np.copyto(host_buffer, img_array)

# Inference
with open("model_fp32.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    
    with engine.create_execution_context() as context:

        # allocate buffers
        inputs, outputs, bindings = allocate_buffers(engine)
        stream = cuda.Stream()

        # load image and pre-processing
        load_input(path_img, inputs[0].host)

        # transfer input data to the GPU.
        cuda.memcpy_htod_async(inputs[0].device, inputs[0].host, stream)
        
        # inference
        inference = context.execute_async(batch_size=1, bindings=bindings, stream_handle=stream.handle)
        
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
        
        # Synchronize the stream
        stream.synchronize()
        
        # Print the host output:
        print("OUTPUT")
        print(outputs)

A sample output of the code is given below:

OUTPUT
[Host:
[42]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e3a0>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e490>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e580>, Host:
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
Device:
<pycuda._driver.DeviceAllocation object at 0x7f707366e670>]

I have tested different images and always get the same type of output.

I am wondering if my issue is related to the pre-processing of the data? Or am I missing something else for the inference?

Also, it is not very clear from the DeepStream documentation, but does the offsets apply to the BGR channels or the RGB channels? I.e. should the first component of the offsets apply to the R channel of the B channel?

Environment

TensorRT Version: 7.1.3
GPU Type: Titan V
Nvidia Driver Version: 455.45.01
CUDA Version: 10.2
CUDNN Version: 8
Operating System + Version: Ubuntu 18.04 LTS
Python Version (if applicable): 3.6.9
Baremetal or Container (if container which image + tag): Baremetal

NVES · April 8, 2021, 11:37am

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.4 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

johan_b · April 8, 2021, 11:42am

Hi NVES,

Here is the output from trtexec (it seems to be ok?):

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=1 --verbose
[04/08/2021-21:39:13] [I] === Model Options ===
[04/08/2021-21:39:13] [I] Format: *
[04/08/2021-21:39:13] [I] Model: 
[04/08/2021-21:39:13] [I] Output:
[04/08/2021-21:39:13] [I] === Build Options ===
[04/08/2021-21:39:13] [I] Max batch: 1
[04/08/2021-21:39:13] [I] Workspace: 16 MB
[04/08/2021-21:39:13] [I] minTiming: 1
[04/08/2021-21:39:13] [I] avgTiming: 8
[04/08/2021-21:39:13] [I] Precision: FP32
[04/08/2021-21:39:13] [I] Calibration: 
[04/08/2021-21:39:13] [I] Safe mode: Disabled
[04/08/2021-21:39:13] [I] Save engine: 
[04/08/2021-21:39:13] [I] Load engine: model.engine
[04/08/2021-21:39:13] [I] Builder Cache: Enabled
[04/08/2021-21:39:13] [I] NVTX verbosity: 0
[04/08/2021-21:39:13] [I] Inputs format: fp32:CHW
[04/08/2021-21:39:13] [I] Outputs format: fp32:CHW
[04/08/2021-21:39:13] [I] Input build shapes: model
[04/08/2021-21:39:13] [I] Input calibration shapes: model
[04/08/2021-21:39:13] [I] === System Options ===
[04/08/2021-21:39:13] [I] Device: 0
[04/08/2021-21:39:13] [I] DLACore: 
[04/08/2021-21:39:13] [I] Plugins:
[04/08/2021-21:39:13] [I] === Inference Options ===
[04/08/2021-21:39:13] [I] Batch: 1
[04/08/2021-21:39:13] [I] Input inference shapes: model
[04/08/2021-21:39:13] [I] Iterations: 10
[04/08/2021-21:39:13] [I] Duration: 3s (+ 200ms warm up)
[04/08/2021-21:39:13] [I] Sleep time: 0ms
[04/08/2021-21:39:13] [I] Streams: 1
[04/08/2021-21:39:13] [I] ExposeDMA: Disabled
[04/08/2021-21:39:13] [I] Spin-wait: Disabled
[04/08/2021-21:39:13] [I] Multithreading: Disabled
[04/08/2021-21:39:13] [I] CUDA Graph: Disabled
[04/08/2021-21:39:13] [I] Skip inference: Disabled
[04/08/2021-21:39:13] [I] Inputs:
[04/08/2021-21:39:13] [I] === Reporting Options ===
[04/08/2021-21:39:13] [I] Verbose: Enabled
[04/08/2021-21:39:13] [I] Averages: 10 inferences
[04/08/2021-21:39:13] [I] Percentile: 99
[04/08/2021-21:39:13] [I] Dump output: Disabled
[04/08/2021-21:39:13] [I] Profile: Disabled
[04/08/2021-21:39:13] [I] Export timing to JSON file: 
[04/08/2021-21:39:13] [I] Export output to JSON file: 
[04/08/2021-21:39:13] [I] Export profile to JSON file: 
[04/08/2021-21:39:13] [I] 
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Proposal version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[04/08/2021-21:39:13] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[04/08/2021-21:39:14] [V] [TRT] Deserialize required 855265 microseconds.
[04/08/2021-21:39:14] [I] Starting inference threads
[04/08/2021-21:39:17] [I] Warmup completed 20 queries over 200 ms
[04/08/2021-21:39:17] [I] Timing trace has 316 queries over 3.01907 s
[04/08/2021-21:39:17] [I] Trace averages of 10 runs:
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.60522 ms - Host latency: 11.1722 ms (end to end 18.4379 ms, enqueue 2.38806 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.60532 ms - Host latency: 11.1729 ms (end to end 18.4686 ms, enqueue 2.37536 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.51675 ms - Host latency: 11.0845 ms (end to end 18.7923 ms, enqueue 2.38184 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53139 ms - Host latency: 11.0985 ms (end to end 18.5644 ms, enqueue 2.37105 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52454 ms - Host latency: 11.0943 ms (end to end 18.7998 ms, enqueue 2.37631 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.54829 ms - Host latency: 11.1189 ms (end to end 18.608 ms, enqueue 2.39402 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.50683 ms - Host latency: 11.0739 ms (end to end 18.7834 ms, enqueue 2.37681 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52607 ms - Host latency: 11.0938 ms (end to end 18.7242 ms, enqueue 2.38451 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.5155 ms - Host latency: 11.0833 ms (end to end 18.7939 ms, enqueue 2.41888 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53549 ms - Host latency: 11.1026 ms (end to end 18.7437 ms, enqueue 2.38383 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.64969 ms - Host latency: 11.2172 ms (end to end 18.9746 ms, enqueue 2.41281 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.56407 ms - Host latency: 11.1328 ms (end to end 18.2136 ms, enqueue 2.38313 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.51993 ms - Host latency: 11.0866 ms (end to end 18.7584 ms, enqueue 2.38447 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53993 ms - Host latency: 11.1071 ms (end to end 18.7939 ms, enqueue 2.37832 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53353 ms - Host latency: 11.1016 ms (end to end 18.6047 ms, enqueue 2.375 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52291 ms - Host latency: 11.0902 ms (end to end 18.8013 ms, enqueue 2.3837 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53189 ms - Host latency: 11.1006 ms (end to end 18.6207 ms, enqueue 2.3755 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52517 ms - Host latency: 11.0925 ms (end to end 18.6595 ms, enqueue 2.37561 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53262 ms - Host latency: 11.1011 ms (end to end 18.5324 ms, enqueue 2.38126 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.5319 ms - Host latency: 11.0992 ms (end to end 18.8119 ms, enqueue 2.377 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52864 ms - Host latency: 11.0968 ms (end to end 18.7304 ms, enqueue 2.37883 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52002 ms - Host latency: 11.0883 ms (end to end 18.7924 ms, enqueue 2.37947 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53667 ms - Host latency: 11.1041 ms (end to end 18.6309 ms, enqueue 2.37495 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52307 ms - Host latency: 11.093 ms (end to end 18.7872 ms, enqueue 2.37766 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53164 ms - Host latency: 11.0994 ms (end to end 18.7299 ms, enqueue 2.38091 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52068 ms - Host latency: 11.0883 ms (end to end 18.7975 ms, enqueue 2.37576 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.53113 ms - Host latency: 11.0993 ms (end to end 18.5183 ms, enqueue 2.37327 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52207 ms - Host latency: 11.0889 ms (end to end 18.7955 ms, enqueue 2.38027 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.58308 ms - Host latency: 11.1534 ms (end to end 18.0532 ms, enqueue 2.37944 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.50659 ms - Host latency: 11.0748 ms (end to end 18.7812 ms, enqueue 2.37905 ms)
[04/08/2021-21:39:17] [I] Average on 10 runs - GPU latency: 9.52559 ms - Host latency: 11.0926 ms (end to end 18.68 ms, enqueue 2.37747 ms)
[04/08/2021-21:39:17] [I] Host Latency
[04/08/2021-21:39:17] [I] min: 11.0056 ms (end to end 13.7264 ms)
[04/08/2021-21:39:17] [I] max: 12.3127 ms (end to end 20.0497 ms)
[04/08/2021-21:39:17] [I] mean: 11.1058 ms (end to end 18.672 ms)
[04/08/2021-21:39:17] [I] median: 11.0933 ms (end to end 18.7994 ms)
[04/08/2021-21:39:17] [I] percentile: 11.3425 ms at 99% (end to end 19.0951 ms at 99%)
[04/08/2021-21:39:17] [I] throughput: 104.668 qps
[04/08/2021-21:39:17] [I] walltime: 3.01907 s
[04/08/2021-21:39:17] [I] Enqueue Time
[04/08/2021-21:39:17] [I] min: 2.3114 ms
[04/08/2021-21:39:17] [I] max: 2.74396 ms
[04/08/2021-21:39:17] [I] median: 2.38004 ms
[04/08/2021-21:39:17] [I] GPU Compute
[04/08/2021-21:39:17] [I] min: 9.44214 ms
[04/08/2021-21:39:17] [I] max: 10.7449 ms
[04/08/2021-21:39:17] [I] mean: 9.53782 ms
[04/08/2021-21:39:17] [I] median: 9.52533 ms
[04/08/2021-21:39:17] [I] percentile: 9.77197 ms at 99%
[04/08/2021-21:39:17] [I] total compute time: 3.01395 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=model.engine --batch=1 --verbose

I’ll be posting the whole script in a separate reply.

johan_b · April 8, 2021, 11:48am

Here is the full script (it’s quite basic):

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
trt.init_libnvinfer_plugins(TRT_LOGGER,'')
DTYPE_TRT = trt.float32
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import Image
import numpy as np

path_img = "image.jpg"
offsets  = ( 103.939, 116.779, 123.68 )
yolo_reso = (3, 768, 1024)

# Simple helper data class that's a little nicer to use than a 2-tuple
# from TRT Python sample code
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        #dtype = DTYPE_TRT
        print(dtype)
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings

def load_input(img_path, host_buffer):
    # convert to BGR and CHW format
    with Image.open(img_path) as img:
        # RGB to BGR
        r, g, b = img.split()              
        img = Image.merge('RGB', (b, g, r))

        c, h, w = yolo_reso
        dtype = trt.nptype(DTYPE_TRT) 
        img_res = img.resize((w, h), Image.BICUBIC)
        img_res = np.array(img_res, dtype=dtype, order='C')

        # HWC to CHW format:
        img_chw = np.transpose(img_res, [2, 0, 1])
       
        # Applying offsets to BGR channels
        img_chw[0] = img_chw[0] - offsets[0]
        img_chw[1] = img_chw[1] - offsets[1]
        img_chw[2] = img_chw[2] - offsets[2]

        img_array = img_chw.ravel()
        np.copyto(host_buffer, img_array)

# Inference
with open("model_fp32.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())
    
    with engine.create_execution_context() as context:

        # allocate buffers
        inputs, outputs, bindings = allocate_buffers(engine)
        stream = cuda.Stream()

        # load image and pre-processing
        load_input(path_img, inputs[0].host)

        # transfer input data to the GPU.
        cuda.memcpy_htod_async(inputs[0].device, inputs[0].host, stream)
        
        # inference
        inference = context.execute_async(batch_size=1, bindings=bindings, stream_handle=stream.handle)
        
        # Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
        
        # Synchronize the stream
        stream.synchronize()
        
        # Print the host output:
        print("OUTPUT")
        print(outputs)

You can download an etlt model file here: Dropbox - File Deleted. The model is trained to detect 15 classes.

I converted it using the tlt-converter utility (for CUDA 10.2, CUDNN 8 and TRT 7.1) using this command:

tlt-converter -k nvidia_tlt \
                   -d 3,768,1024 \
                   -o BatchedNMS \
                   -e model_fp32.engine \
                   -m 1 \
                   -t fp32 \
                   -i nchw \
                    yolov4.etlt

If I use the etlt or TensorRT engine file in DeepStream, it works without any issue. But unfortunately I cannot use DeepStream for this work.

Any idea why I only get 0? I tried different type of pre-processing (with and without the offsets, BGR and RGB format, dividing the pixel values by 255…) and I always get the same type of output (only the number of detection varies).

spolisetty · April 9, 2021, 7:10am

Hi @johan_b,

We request you to post your concern in TLT forum to get better help. You may get more details on postprocessing part.

Thank you.

johan_b · April 9, 2021, 11:50am

Hi spolisetty,

I just did: Inference with TensorRT after training Yolo v4 with TLT 3.0

Thanks,

Johan

johan_b · April 10, 2021, 6:34am

Hi again,

I finally found the issue: I was not transferring all the data back from the GPU.

# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(outputs[0].host, outputs[0].device, stream)
cuda.memcpy_dtoh_async(outputs[1].host, outputs[1].device, stream)
cuda.memcpy_dtoh_async(outputs[2].host, outputs[2].device, stream)
cuda.memcpy_dtoh_async(outputs[3].host, outputs[3].device, stream)

Thanks,

Johan

a.jetti · September 27, 2021, 11:07am

Hi,
I’m getting the below error when I try to run the same code outside tlt docker.

[TensorRT] INTERNAL ERROR: Assertion failed: d == a + length
/opt/tensorrt/TensorRT/plugin/batchedNMSPlugin/batchedNMSPlugin.cpp:70
Aborting…

Aborted (core dumped)

However, if I run the tensorrt engine inside TLT3.0 docker, I’m getting the required output. Could you please help me in resolving the above issue?

Topic		Replies	Views
Inference with TensorRT after training Yolo v4 with TLT 3.0 TAO Toolkit	6	2126	October 12, 2021
Doing inference in python with YOLO V4 in TensorRT - postporsessing TAO Toolkit yolo	7	3470	October 12, 2021
YOLOV4 - TensorRT int8 inference in Python TensorRT tensorrt , yolo	1	1311	September 29, 2021
Inferring detectnet_v2 .trt model in python TAO Toolkit tensorrt	58	4155	August 17, 2021
YOLOv4 TensorRT inference results wayy off, but onnxruntime is not TensorRT tensorrt	7	1060	June 7, 2022
C++ - Stuck with YoloV4, ONNX and TensorRT TensorRT	5	1132	February 8, 2024
Inferring Yolo_v3.trt model in python TAO Toolkit tensorrt	38	3833	October 12, 2021
Help needed to convert yolov4-tiny model to tensorRT engine (DS 5) DeepStream SDK	2	2237	October 12, 2021
Iplugin tensorrt engine error for ds5.0 DeepStream SDK	29	4457	October 12, 2021
Python sample yolov3 app on tensorrt Jetson Xavier NX tensorrt , yolo , python	9	1801	October 18, 2021

YOLO v4 inference with TensorRT after training with TLT 3.0

Description

Environment

Related topics