0.3fps when using yolov3_onnx in TensorRT examples provided by Nvidia in Jetson Nano

Q: I Was meant to use the tensorrt_samples(yolov3_onnx) to detect objects,which is given at https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#yolov3_onnx,After I run two scripts,the detection was begin,But this process was time-consuming which was just 0.3fps.Beyond my imagination.I don’t know what’s wrong, please give me some advice.

  1. $ sudo python yolov3_to_onnx.py
  2. $ sudo python onnx_to_tensorrt.py
30. Object Detection With The ONNX TensorRT Backend In Python
What Does This Sample Do?

This sample, yolov3_onnx, implements a full ONNX-based pipeline for performing inference with the YOLOv3 network, with an input size of 608x608 pixels, including pre and post-processing. This sample is based on the YOLOv3-608 paper.
Note: This sample is not supported on Ubuntu 14.04 and older. Additionally, the yolov3_to_onnx.py script does not support Python 3.
Where Is This Sample Located?
This sample is installed in the /usr/src/tensorrt/samples/python/yolov3_onnx directory.
Getting Started:

Refer to the /usr/src/tensorrt/samples/python/yolov3_onnx/README.md file for detailed information about how this sample works, sample code, and step-by-step instructions on how to run and verify its output.

A summary of the README.md file is included in this section for your reference, however, you should always refer to the README.md within the package for the most recent documentation updates.

Here is my onnx_to_tensorrt.py code:

#!/usr/bin/env python2

from __future__ import print_function
import cv2 as cv
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import ImageDraw
import time
from yolov3_to_onnx import download_file
from data_processing import PreprocessYOLO, PostprocessYOLO, ALL_CATEGORIES

import sys, os
sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common

TRT_LOGGER = trt.Logger()
cap = cv.VideoCapture(0)

def load_label_categories(label_file_path):
    categories = [line.rstrip('\n') for line in open(label_file_path)]
    return categories

LABEL_FILE_PATH = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'coco_labels.txt')
ALL_CATEGORIES = load_label_categories(LABEL_FILE_PATH)

def get_engine1(vv):
    with trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(vv)

def get_engine(onnx_file_path, engine_file_path=""):
    def build_engine():
        """Takes an ONNX file and creates a TensorRT engine to run inference with"""
        with trt.Builder(TRT_LOGGER) as builder,\
              builder.create_network() as network, \
              trt.OnnxParser(network, TRT_LOGGER) as parser:

            builder.max_workspace_size = 1 << 30 # 1GB
            builder.max_batch_size = 1

            if not os.path.exists(onnx_file_path):
                print('ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))

            print('Loading ONNX file from path {}...'.format(onnx_file_path))
            with open(onnx_file_path, 'rb') as model:
                print('Beginning ONNX file parsing')
            print('Completed parsing of ONNX file')

            print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
            engine = builder.build_cuda_engine(network)
            print("Completed creating Engine")

            with open(engine_file_path, "wb") as f:
            return engine

    if os.path.exists(engine_file_path):
        print("Reading engine from file {}".format(engine_file_path))
        with open(engine_file_path, "rb") as f, \
             trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())
        return build_engine()

def main():
    """Create a TensorRT engine for ONNX-based YOLOv3-608 and run inference."""

    # Try to load a previously generated YOLOv3-608 network graph in ONNX format:
    onnx_file_path = './yolov3.onnx'
    engine_file_path = "./yolov3.trt"
    file = open(engine_file_path,"rb")
    f = file.read()
    engine= get_engine1(f)
    context = engine.create_execution_context()
    # with get_engine(onnx_file_path, engine_file_path) as engine:
    #     print("finished")
    # Download a dog image and save it to the following file path:
    while True:
        ret,frame = cap.read()
        if ret:
            x, y = frame.shape[0:2]
	        # Two-dimensional tuple with the target network's (spatial) input resolution in HW ordered
            input_resolution_yolov3_HW = (608, 608)
            # Create a pre-processor object by specifying the required input resolution for YOLOv3
            preprocessor = PreprocessYOLO(input_resolution_yolov3_HW)
            # Load an image from the specified input path, and return it together with  a pre-processed version
            image_raw, image = preprocessor.process(frame)
            # Store the shape of the original input image in WH format, we will need it for later
            shape_orig_WH = image_raw.size

            # Output shapes expected by the post-processor
            output_shapes = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)]
            # Do inference with TensorRT
            trt_outputs = [] #get_engine(onnx_file_path, engine_file_path) as engine,
            inputs, outputs, bindings, stream = common.allocate_buffers(engine)
                # Do inference
                #print('Running inference on image {}...'.format(input_image_path))
                # Set host input to the image. The common.do_inference function will copy the input to the GPU before executing.
            inputs[0].host = image
            trt_outputs = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
            #b = time.clock()
            # Before doing post-processing, we need to reshape the outputs as the common.do_inference will give us flat arrays.
            trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]

            postprocessor_args = {"yolo_masks": [(6, 7, 8), (3, 4, 5), (0, 1, 2)],                    # A list of 3 three-dimensional tuples for the YOLO masks
                            "yolo_anchors": [(10, 13), (16, 30), (33, 23), (30, 61), (62, 45),  # A list of 9 two-dimensional tuples for the YOLO anchors
                                            (59, 119), (116, 90), (156, 198), (373, 326)],
                            "obj_threshold": 0.6,                                               # Threshold for object coverage, float value between 0 and 1
                            "nms_threshold": 0.5,                                               # Threshold for non-max suppression algorithm, float value between 0 and 1
                            "yolo_input_resolution": input_resolution_yolov3_HW}

            postprocessor = PostprocessYOLO(**postprocessor_args)

            # Run the post-processing algorithms on the TensorRT outputs and get the bounding box details of detected objects
            boxes, classes, scores = postprocessor.process(trt_outputs, (shape_orig_WH))
            if boxes is None:
                for box, score, category in zip(boxes, scores, classes):
                    x_coord, y_coord, width, height = box
                    left = max(0, np.floor(x_coord + 0.5).astype(int))
                    top = max(0, np.floor(y_coord + 0.5).astype(int))
                    right = min(image_raw.width, np.floor(x_coord + width + 0.5).astype(int))
                    bottom = min(image_raw.height, np.floor(y_coord + height + 0.5).astype(int))
                    cv.putText(frame,"%s:%.2f"%(ALL_CATEGORIES[category],score),(left, top - 12),cv.FONT_HERSHEY_SIMPLEX,0.7,(0,0,255),2,0)

            c = cv.waitKey(20)
            if c==27:
if __name__ == '__main__':


Would you mind to try the YOLO sample within DeepStream SDK?

The one in DeepSteam SDK has optimized for the camera pipeline and buffer.

Thanks you,but I want to find out why the example provided takes so much time to detect object,even 7 seconds per pictures。What is your result in using this example?

Actually,I am faimiler with opencv,I will try deepstream if the yolo_onnx turned out to be slow itself.


The bottleneck should come from OpenCV.

Based on the source, the image is read with CPU buffer, and copy to GPU for the inference.
This is not an optimal solution for multimedia pipeline.
The sample targets for demonstrating TensorRT and not focus on the whole pipeline.


I modified TensorRT ‘yolov3_onnx’ sample and was getting 1.53 FPS (yolov3-618) on Jetson Nano. (The FPS measurement included image acquisition and all of preprocessing/postprocessing.)


One particular problem of the original ‘yolov3_onnx’ sample was its inefficient implementation of the postprocessing code. More specifically, I rewrote ‘sigmoid_v()’ and ‘exponential_v()’ with numpy calls and was able to speed up inference pretty noticeably.

Source code for reference: https://github.com/jkjung-avt/tensorrt_demos/blob/master/utils/yolov3.py#L212

1 Like

Hi jkjung13,

Appreciate your sharing and contribution!

Hi jkjung13,
Thanks a lot,I’ll try your method!