Engine Plan Inference on JetsonTX2

Hi guys,

I trained a model using the retinanet-examples repo on a host computer, worked greatly. I optimized using TensorRT still on the host, and now I have a “.plan” file that I’d like to use to do inference on the JetsonTX2.

The only information I found is this paragraph of the TensorRT Developer Guide:

3.5. Performing Inference In Python

The following steps illustrate how to perform inference in Python, now that you have an engine.

  1. Allocate some host and device buffers for inputs and outputs:

# Determine dimensions and create page-locked memory buffers (i.e. won’t be swapped to disk) to hold host inputs/outputs. h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=np.float32) h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=np.float32) # Allocate device memory for inputs and outputs. d_input = cuda.mem_alloc(h_input.nbytes) d_output = cuda.mem_alloc(h_output.nbytes) # Create a stream in which to copy inputs/outputs and run inference. stream = cuda.Stream()

  1. Create some space to store intermediate activation values. Since the engine holds the network definition and trained parameters, additional space is necessary. These are held in an execution context:

with engine.create_execution_context() as context: # Transfer input data to the GPU. cuda.memcpy_htod_async(d_input, h_input, stream) # Run inference. context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle) # Transfer predictions back from the GPU. cuda.memcpy_dtoh_async(h_output, d_output, stream) # Synchronize the stream stream.synchronize() # Return the host output. return h_output

An engine can have multiple execution contexts, allowing one set of weights to be used for multiple overlapping inference tasks. For example, you can process images in parallel CUDA streams using one engine and one context per stream. Each context will be created on the same GPU as the engine.

It gives some hints but I don’t get how to make the initial import, is there any other sources that shows step by step how to import an engine, run it on images and show the boxes detected on the output.
Also how can I set the confindence threshold that should be use by the engine when it detects an object ?

Just saying it’s really hard to find reliable information in the entire world of documentation…

Thank you in advance

Hi,

Please noticed that the TensorRT engine doesn’t support portability.
So you will need to recreate it on Jetson TX2 directly.

To run RetinaNet on Jetson, the simplest way is to follow our Deepstream sample here:

Thanks.

Hi AastaLLL,
thanks for precision.

I was wondering if there was another way to use the model on Jetson.
I have a .pth file too, I can convert it to .onnx and then run the optimization using TensorRT on the Jetson right ? (I can’t remember which Nvidia repo I should use and that contained tensorRT script to do the optimization from an onnx file, is it tensorflow something ?)

The idea would be to use the model without retinanet command just like any model.
As the TensorRT Python API isn’t supported by Jetpack3.3 I’ll do the TensorRT test after updating the OS (or try using C++ API but I don’t master C++).

Because of the lockdown I can’t access my Jetson physically, only via ssh but I would like to run some test (speed and accuracy) and my version is Jetpack3.3 (I’ll update it right after the end of lockdown).

Thanks

Hi,

YES. You can use the onnx model as the input to generate TensorRT engine.
There are two sample can apply this conversion:

1. GitHub

2. TensorRT sample

Just convert the model with this command:

$ /usr/src/tensorrt/bin/trtexec --onnx=[path/to/onnx/file] --saveEngine=[output/engine/name]

Thanks.

Ok great !

And then how should I proceed to do some inference using the engine plan? Is there any way using Python?

What if I want to compare performance between the model before and after optimization, can I do inference using the onnx model directly ?

Thank you

Hi,

The trtexec tool will generate some benchmark data for you.
We also have some python sample in /usr/src/tensorrt/samples/python/.

The onnx model can be inferenced with onnx Runtime:

Thanks.

Hi AastaLLL,

I tried to convert a onnx model to a trt file using the trtexec command you provided, but here is the error I get:

nvidia@tegra-ubuntu:~$ /usr/src/tensorrt/bin/trtexec --onnx=RN50/rn50crop.onn                                                                                x --engine=engine_rn50crop.trt
onnx: RN50/rn50crop.onnx
engine: engine_rn50crop.trt
input: "533"
input: "535"
output: "536"
op_type: "Resize"
attribute {
  name: "mode"
  s: "nearest"
  type: STRING
}
terminate called after throwing an instance of 'std::out_of_range'
  what():  No converter registered for op type: Resize

Seems it’s a problem of size of input, how should I deal with it as there are no parameter for size in trtexec ?

Thanks

PS: I have an old version of TRT (TensorRT4 with cuda9.0) with Jetpack3.3, so it seems I can’t use onnx-tensorrt repo as it is for TensorRT5 at least…

Hey,
I updated my Jetson TX2 to Jetpack 4.3 so now it’s all new environment with TensorRT7.1.

Concerning the onnx-tensorrt repository I don’t get how to install it, it says to refer to TensorRT installation but as I have Jetpack I have it already installed.

By the way how can I link the TensorRT library to a virtual environment to call it using “import tensorrt” ?

Thanks

Hi,

Suppose you are using JetPack4.4 rather than v4.3.

1. Please follow these steps to build onnx-tensorrt from source:

ONNX

export PYVER=3.6
git clone https://github.com/onnx/onnx-tensorrt.git
cd onnx-tensorrt/third_party/onnx/
export CPLUS_INCLUDE_PATH=/usr/include/python3.6:/usr/local/cuda/targets/aarch64-linux/include
mkdir -p build && cd build
cmake -DCMAKE_CXX_FLAGS=-I/usr/include/python${PYVER} -DBUILD_ONNX_PYTHON=ON -Dpybind11_DIR=/home/nvidia/pybind11/install/share/cmake/pybind11/ -DBUILD_SHARED_LIBS=ON ..
sudo make -j$(nproc) install && \
sudo ldconfig && \
cd .. && \
sudo mkdir -p /usr/include/x86_64-linux-gnu/onnx && \
sudo cp build/onnx/onnx*pb.* /usr/include/x86_64-linux-gnu/onnx && \
sudo cp build/libonnx.so /usr/local/lib && \
sudo rm -f /usr/lib/x86_64-linux-gnu/libonnx_proto.a && \
sudo ldconfig

ONNX-TensorRT

cd ../../ && \
mkdir -p build && \
cd build && \
cmake  -DCMAKE_CXX_FLAGS=-I/usr/local/cuda/targets/aarch64-linux/include -DONNX_NAMESPACE=onnx2trt_onnx .. && \
sudo make -j$(nproc) install && \
sudo ldconfig
cd ../../../../

2. Yes. Please try this:

import ctypes
...
ctypes.CDLL("lib/libxxxxx.so")

Thanks.

Hey,
thanks for install advice it works great.

Now I want to use the engine file to do some inference and here is what I use:

import tensorrt as trt
import argparse
from onnx import ModelProto
import pycuda.driver as cuda
import numpy as np
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape = [64,1280,1280,3]):

   """
   This is the function to create the TensorRT engine
   Args:
      onnx_path : Path to onnx_file.
      shape : Shape of the input of the ONNX file.
  """
   with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
       builder.max_workspace_size = (1 << 16)
       with open(onnx_path, 'rb') as model:
           parser.parse(model.read())
       network.get_input(0).shape = shape
       engine = builder.build_cuda_engine(network)
       return engine
def save_engine(engine, file_name):
   buf = engine.serialize()
   with open(file_name, 'wb') as f:
       f.write(buf)
	   print('Engine saved')
def load_engine(trt_runtime, engine_path):
   with open(engine_path, 'rb') as f:
       engine_data = f.read()
   engine = trt_runtime.deserialize_cuda_engine(engine_data)
   return engine

 #########################################################################


def allocate_buffers(engine, batch_size, data_type):

   """
   This is the function to allocate buffers for input and output in the device
   Args:
      engine : The path to the TensorRT engine.
      batch_size : The batch size for execution time.
      data_type: The type of the data for input and output, for example trt.float32.

   Output:
      h_input_1: Input in the host.
      d_input_1: Input in the device.
      h_output_1: Output in the host.
      d_output_1: Output in the device.
      stream: CUDA stream.

   """

   # Determine dimensions and create page-locked memory buffers (which won't be swapped to disk) to hold host inputs/outputs.
   h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
   h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
   # Allocate device memory for inputs and outputs.
   d_input_1 = cuda.mem_alloc(h_input_1.nbytes)

   d_output = cuda.mem_alloc(h_output.nbytes)
   # Create a stream in which to copy inputs/outputs and run inference.
   stream = cuda.Stream()
   return h_input_1, d_input_1, h_output, d_output, stream

def load_images_to_buffer(pics, pagelocked_buffer):
   preprocessed = np.asarray(pics).ravel()
   np.copyto(pagelocked_buffer, preprocessed)

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):
   """
   This is the function to run the inference
   Args:
      engine : Path to the TensorRT engine
      pics_1 : Input images to the model.
      h_input_1: Input in the host
      d_input_1: Input in the device
      h_output_1: Output in the host
      d_output_1: Output in the device
      stream: CUDA stream
      batch_size : Batch size for execution time
      height: Height of the output image
      width: Width of the output image

   Output:
      The list of output images

   """

   load_images_to_buffer(pics_1, h_input_1)

   with engine.create_execution_context() as context:
       # Transfer input data to the GPU.
       cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

       # Run inference.

       context.profiler = trt.Profiler()
       context.execute(batch_size=batch_size, bindings=[int(d_input_1), int(d_output)])

       # Transfer predictions back from the GPU.
       cuda.memcpy_dtoh_async(h_output, d_output, stream)
       # Synchronize the stream
       stream.synchronize()
       # Return the host output.
       out = h_output.reshape((batch_size,-1, height, width))
       return out

And the file to execute using the trt file:

import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
import pycuda.autoinit
import argparse
from onnx import ModelProto
from function import *
import sys

TRT_LOGGER = trt.Logger(trt.Logger.INFO)
trt.init_libnvinfer_plugins(TRT_LOGGER,'')
trt_runtime = trt.Runtime(TRT_LOGGER)
batch_size = 64
data_type = trt.float16
height = 1280
width = 1280

engine_path = sys.argv[1]

engine = load_engine(trt_runtime,engine_path)

h_input_1, d_input_1, h_output, d_output, stream = allocate_buffers(engine,batch_size,data_type)

out = do_inferencedo_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width)
print(out)

But when I do that I encounter this error:

(cv) nvidia@nvidia-desktop:~/test_trt$ python trt_infer.py rn50engine.trt 439.jpg
[TensorRT] ERROR: ../rtSafe/cuda/cudaActivationRunner.cpp (103) - Cudnn Error in execute: 3 (CUDNN_STATUS_BAD_PARAM)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
[TensorRT] ERROR: engine.cpp (179) - Cuda Error in ~ExecutionContext: 719 (unspecified launch failure)
[TensorRT] ERROR: INTERNAL_ERROR: std::exception
[TensorRT] ERROR: Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::155, condition: cudnnDestroy(context.cudnn) failure.
[TensorRT] ERROR: Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::165, condition: cudaEventDestroy(context.start) failure.
[TensorRT] ERROR: Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::170, condition: cudaEventDestroy(context.stop) failure.
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 719 (unspecified launch failure)
terminate called after throwing an instance of 'nvinfer1::CudaError'
  what():  std::exception
Aborted (core dumped)

My problem is that it’s not really clear what the workflow is to execute an engine on a image or video.

Thanks

Hi,

Cudnn Error in execute: 3 (CUDNN_STATUS_BAD_PARAM)

This error indicates you may not feed the correct input into cuDNN/TensorRT.
We have a python example to demonstrate how to deploy a TensorRT model.


Please check the sample to see if anything missing in your implementation.

Thanks