How can i get 1000 FPS by running the inference with TensorRT Tiny-YOLOv3 (Jetson AGX Xavier)

chakibdace · May 18, 2021, 3:06pm

Hello,

I’m trying to reproduce NVIDIA benchmark with TensorRT Tiny-YOLOv3 (getting 1000 FPS) on a Jetson AGX Xavier target with the parameters below (i got only 700 FPS):

Power Mode : MAXN
Input resolution : 416x416
Precision Mode : INT8 (Calibration with 1000 images and IInt8EntropyCalibrator2 interface)
batch = 8
JetPack Version : 4.5.1
TensorRT version : 7.1.3

So first i generated the serialized graph model with format ONNX by following jkjung-avt steps and using the file yolov3_to_onnx.py which is contained on the /usr/src/tensorrt/samples/python/yolov3_onnx and then i’ve edited the script onnx_to_tensorrt.py to generate the TRTEngine and get the tiny-yolov3.trt to run the inference.

There is the tiny-yolov3.trt i’ve used to run the inference

#!/usr/bin/env python2
#
# Copyright 1993-2020 NVIDIA Corporation.  All rights reserved.
#
# NOTICE TO LICENSEE:
#
# This source code and/or documentation ("Licensed Deliverables") are
# subject to NVIDIA intellectual property rights under U.S. and
# international Copyright laws.
#
# These Licensed Deliverables contained herein is PROPRIETARY and
# CONFIDENTIAL to NVIDIA and is being provided under the terms and
# conditions of a form of NVIDIA software license agreement by and
# between NVIDIA and Licensee ("License Agreement") or electronically
# accepted by Licensee.  Notwithstanding any terms or conditions to
# the contrary in the License Agreement, reproduction or disclosure
# of the Licensed Deliverables to any third party without the express
# written consent of NVIDIA is prohibited.
#
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, NVIDIA MAKES NO REPRESENTATION ABOUT THE
# SUITABILITY OF THESE LICENSED DELIVERABLES FOR ANY PURPOSE.  IT IS
# PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED WARRANTY OF ANY KIND.
# NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE LICENSED
# DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY,
# NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
# NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE
# LICENSE AGREEMENT, IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY
# SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, OR ANY
# DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
# WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
# ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
# OF THESE LICENSED DELIVERABLES.
#
# U.S. Government End Users.  These Licensed Deliverables are a
# "commercial item" as that term is defined at 48 C.F.R. 2.101 (OCT
# 1995), consisting of "commercial computer software" and "commercial
# computer software documentation" as such terms are used in 48
# C.F.R. 12.212 (SEPT 1995) and is provided to the U.S. Government
# only as a commercial end item.  Consistent with 48 C.F.R.12.212 and
# 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all
# U.S. Government End Users acquire the Licensed Deliverables with
# only those rights set forth herein.
#
# Any use of the Licensed Deliverables in individual and commercial
# software must include, in the user documentation and internal
# comments to the code, the above Disclaimer and U.S. Government End
# Users Notice.
#

from __future__ import print_function

import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
from PIL import ImageDraw


from yolov3_to_onnx import download_file
from data_processing import PreprocessYOLO, PostprocessYOLO, ALL_CATEGORIES

import sys, os

sys.path.insert(1, os.path.join(sys.path[0], ".."))
import common

#print(sys.modules['common'])


import time
import argparse
import cv2
TRT_LOGGER = trt.Logger()




desc = ('This is an edited NVIDIA sample about how to implement YOLOv3 and Tiny-YOLOv3 with TensorRT'
            ',before executing this code, we have to execute yolov3_to_onnnx.py to parse the DarkNet model into ONNX model'
            ',after the generation of the serialized model.onnx, we can run this code and specify the parameters like the model, resolution...'
        'For example to run a YOLOv3 model on the image dog.jpg with a 416x416 resolution and FP16 precision mode and a batch=1 we have to use this command : '
        '=========================================================================== sudo python3 onnx_to_tensorrt.py -i dog -m yolov3 -r 416 -p FP16 -b 1')
parser = argparse.ArgumentParser(description=desc)
parser.add_argument('-i', '-input', '-image', help="Set the name of the input image", type=str)
parser.add_argument('-m', '-model', help="Set the name of the model you want to use \n <<yolov3>> to use YOLOv3 \n <<tiny>> to use Tiny-YOLOv3", type=str)
parser.add_argument('-r', '-resolution', help="Set the resolution of the input [608, 416 or 288]", type=str)
parser.add_argument('-p', '-precision', help="Set the precision mode [FP32, FP16 or INT8]", type=str)
parser.add_argument('-b', '-batch', help="Set The size of the batch", type=int)
args = parser.parse_args()

batch_size = args.b



class YOLOEntropyCalibrator(trt.IInt8EntropyCalibrator2):
    """YOLOEntropyCalibrator

    This class implements TensorRT's IInt8EntropyCalibtrator2 interface.
    It reads all images from the specified directory and generates INT8
    calibration data for YOLO models accordingly.
    """

    def __init__(self, img_dir, net_hw, cache_file, batch_size=1):
        if not os.path.isdir(img_dir):
            raise FileNotFoundError('%s does not exist' % img_dir)
        if len(net_hw) != 2 or net_hw[0] % 32 or net_hw[1] % 32:
            raise ValueError('bad net shape: %s' % str(net_hw))

        super().__init__()  # trt.IInt8EntropyCalibrator2.__init__(self)

        self.img_dir = img_dir
        self.net_hw = net_hw
        self.cache_file = cache_file
        self.batch_size = batch_size
        self.blob_size = 3 * net_hw[0] * net_hw[1] * np.dtype('float32').itemsize * batch_size

        self.jpgs = [f for f in os.listdir(img_dir) if f.endswith('.jpg')]
        # The number "500" is NVIDIA's suggestion.  See here:
        # https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#optimizing_int8_c
        if len(self.jpgs) < 500:
            print('WARNING: found less than 500 images in %s!' % img_dir)
        self.current_index = 0

        # Allocate enough memory for a whole batch.
        self.device_input = cuda.mem_alloc(self.blob_size)

    def __del__(self):
        del self.device_input  # free CUDA memory

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index + self.batch_size > len(self.jpgs):
            return None
        current_batch = int(self.current_index / self.batch_size)

        batch = []
        for i in range(self.batch_size):
            img_path = os.path.join(
                self.img_dir, self.jpgs[self.current_index + i])
            img = cv2.imread(img_path)
            assert img is not None, 'failed to read %s' % img_path
            batch.append(_preprocess_yolo(img, self.net_hw))
        batch = np.stack(batch)
        assert batch.nbytes == self.blob_size

        cuda.memcpy_htod(self.device_input, np.ascontiguousarray(batch))
        self.current_index += self.batch_size
        return [self.device_input]

    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again.
        # Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, 'rb') as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, 'wb') as f:
            f.write(cache)

def draw_bboxes(image_raw, bboxes, confidences, categories, all_categories, bbox_color='blue'):
    """Draw the bounding boxes on the original input image and return it.

    Keyword arguments:
    image_raw -- a raw PIL Image
    bboxes -- NumPy array containing the bounding box coordinates of N objects, with shape (N,4).
    categories -- NumPy array containing the corresponding category for each object,
    with shape (N,)
    confidences -- NumPy array containing the corresponding confidence for each object,
    with shape (N,)
    all_categories -- a list of all categories in the correct ordered (required for looking up
    the category name)
    bbox_color -- an optional string specifying the color of the bounding boxes (default: 'blue')
    """
    draw = ImageDraw.Draw(image_raw)
    print(bboxes, confidences, categories)
    for box, score, category in zip(bboxes, confidences, categories):
        x_coord, y_coord, width, height = box
        left = max(0, np.floor(x_coord + 0.5).astype(int))
        top = max(0, np.floor(y_coord + 0.5).astype(int))
        right = min(image_raw.width, np.floor(x_coord + width + 0.5).astype(int))
        bottom = min(image_raw.height, np.floor(y_coord + height + 0.5).astype(int))

        draw.rectangle(((left, top), (right, bottom)), outline=bbox_color)
        draw.text((left, top - 12), '{0} {1:.2f}'.format(all_categories[category], score), fill=bbox_color)

    return image_raw


def resolution_percisionMode_choice(model, resolution, precision, batch):
    if resolution == "608":
        input_resolution = (608, 608)
        input_shape = [batch, 3, 608, 608]
        output_shape = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)]
        if model == "yolov3":
            path_onnx = "yolov3-608.onnx"
            if batch == 1:
                if precision == "FP32":
                    path_trt = "yolov3-608_FP32.trt"
                elif precision == "FP16":
                    path_trt = "yolov3-608_FP16.trt"
                else:
                    path_trt = "yolov3-608_INT8.trt"
            else:
                if precision == "FP32":
                    path_trt = "yolov3-608_FP32_b"+str(batch)+".trt"
                elif precision == "FP16":
                    path_trt = "yolov3-608_FP16_b"+str(batch)+".trt"
                else:
                    path_trt = "yolov3-608_INT8_b"+str(batch)+".trt"

        else:
            path_onnx = "yolov3-tiny-608.onnx"
            if batch == 1:
                if precision == "FP32":
                    path_trt = "yolov3-tiny-608_FP32.trt"
                elif precision == "FP16":
                    path_trt = "yolov3-tiny-608_FP16.trt"
                else:
                    path_trt = "yolov3-tiny-608_INT8.trt"
            else:
                if precision == "FP32":
                    path_trt = "yolov3-tiny-608_FP32_b"+str(batch)+".trt"
                elif precision == "FP16":
                    path_trt = "yolov3-tiny-608_FP16_b"+str(batch)+".trt"
                else:
                    path_trt = "yolov3-tiny-608_INT8_b"+str(batch)+".trt"



    elif resolution == "416":
        input_resolution = (416, 416)
        input_shape = [batch, 3, 416, 416]
        output_shape = [(1, 255, 13, 13), (1, 255, 26, 26), (1, 255, 52, 52)]
        if model == "yolov3":
            path_onnx = "yolov3-416.onnx"
            if batch == 1:
                if precision == "FP32":
                    path_trt = "yolov3-416_FP32.trt"
                elif precision == "FP16":
                    path_trt = "yolov3-416_FP16.trt"
                else:
                    path_trt = "yolov3-416_INT8.trt"
            else:
                if precision == "FP32":
                    path_trt = "yolov3-416_FP32_b"+str(batch)+".trt"
                elif precision == "FP16":
                    path_trt = "yolov3-416_FP16_b"+str(batch)+".trt"
                else:
                    path_trt = "yolov3-416_INT8_b"+str(batch)+".trt"

        else:
            path_onnx = "yolov3-tiny-416.onnx"
            if batch == 1:
                if precision == "FP32":
                    path_trt = "yolov3-tiny-416_FP32.trt"
                elif precision == "FP16":
                    path_trt = "yolov3-tiny-416_FP16.trt"
                else:
                    path_trt = "yolov3-tiny-416_INT8.trt"
            else:
                if precision == "FP32":
                    path_trt = "yolov3-tiny-416_FP32_b" + str(batch) + ".trt"
                elif precision == "FP16":
                    path_trt = "yolov3-tiny-416_FP16_b" + str(batch) + ".trt"
                else:
                    path_trt = "yolov3-tiny-416_INT8_b" + str(batch) + ".trt"




    elif resolution == "288":
        input_resolution = (288, 288)
        input_shape = [batch, 3, 288, 288]
        if model == "yolov3":
            path_onnx = "yolov3-288.onnx"
            if batch == 1:
                if precision == "FP32":
                        path_trt = "yolov3-288_FP32.trt"
                elif precision == "FP16":
                        path_trt = "yolov3-288_FP16.trt"
                else:
                        path_trt = "yolov3-288_INT8.trt"
            else:
                if precision == "FP32":
                    path_trt = "yolov3-288_FP32_b" + str(batch) + ".trt"
                elif precision == "FP16":
                    path_trt = "yolov3-288_FP16_b" + str(batch) + ".trt"
                else:
                    path_trt = "yolov3-288_INT8_b" + str(batch) + ".trt"

        else:
            path_onnx = "yolov3-tiny-288.onnx"
            if batch == 1:
                if precision == "FP32":
                    path_trt = "yolov3-tiny-288_FP32.trt"
                elif precision == "FP16":
                    path_trt = "yolov3-tiny-288_FP16.trt"
                else:
                    path_trt = "yolov3-tiny-288_INT8.trt"
            else:
                if precision == "FP32":
                    path_trt = "yolov3-tiny-288_FP32_b" + str(batch) + ".trt"
                elif precision == "FP16":
                    path_trt = "yolov3-tiny-288_FP16_b" + str(batch) + ".trt"
                else:
                    path_trt = "yolov3-tiny-288_INT8_b" + str(batch) + ".trt"

    else:
        print("ERROR : The resolution can take only the following values : 608, 416 or 288, try again")
    return input_resolution, input_shape, path_trt, path_onnx, output_shape


input_res, input_shape, path_trt, path_onnx, output_shape = resolution_percisionMode_choice(args.m, args.r, args.p, batch_size)


def get_engine(onnx_file_path, engine_file_path=""):
    """Attempts to load a serialized engine if available, otherwise builds a new TensorRT engine and saves it."""

    def build_engine():
        """Takes an ONNX file and creates a TensorRT engine to run inference with"""
        with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
                common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
            trt.init_libnvinfer_plugins(None, "")

            builder.max_workspace_size = 1 << 28  # 256MiB
            builder.max_batch_size = batch_size
            if args.p == "FP16":
                print("Using FP16 precision mode...")
                builder.fp16_mode = True
            if args.p == "INT8":
                print("Using INT8 precision mode...")
                builder.int8_mode = True
                builder.int8_calibrator = YOLOEntropyCalibrator('calib_images', (416, 416), 'calib_yolov3-tiny-int8-416.bin')
            # Parse model file
            if not os.path.exists(onnx_file_path):
                print(
                    'ONNX file {} not found, please run yolov3_to_onnx.py first to generate it.'.format(onnx_file_path))
                exit(0)
            print('Loading ONNX file from path {}...'.format(onnx_file_path))
            with open(onnx_file_path, 'rb') as model:
                print('Beginning ONNX file parsing')
                if not parser.parse(model.read()):
                    print('ERROR: Failed to parse the ONNX file.')
                    for error in range(parser.num_errors):
                        print(parser.get_error(error))
                    return None
            # The actual yolov3.onnx is generated with batch size 64. Reshape input to batch size 1
            network.get_input(0).shape = input_shape
            print('Completed parsing of ONNX file')
            print('Building an engine from file {}; this may take a while...'.format(onnx_file_path))
            engine = builder.build_cuda_engine(network)
            print("Completed creating Engine")
            with open(engine_file_path, "wb") as f:
                f.write(engine.serialize())
            return engine

    if os.path.exists(engine_file_path):
        # If a serialized engine exists, use it instead of building an engine.
        print("Reading engine from file {}".format(engine_file_path))
        with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())
    else:
        return build_engine()



def main():
    """Create a TensorRT engine for ONNX-based YOLOv3-608 and run inference."""

    # Try to load a previously generated YOLOv3-608 network graph in ONNX format:
    onnx_file_path = path_onnx
    engine_file_path = path_trt
    # Download a dog image and save it to the following file path:
    input_img = args.i + ".jpg"
    input_image_path = input_img

    # Two-dimensional tuple with the target network's (spatial) input resolution in HW ordered
    input_resolution_yolov3_HW = input_res
    # Create a pre-processor object by specifying the required input resolution for YOLOv3
    preprocessor = PreprocessYOLO(input_resolution_yolov3_HW)
    # Load an image from the specified input path, and return it together with  a pre-processed version
    image_raw, image = preprocessor.process(input_image_path)
    # Store the shape of the original input image in WH format, we will need it for later
    shape_orig_WH = image_raw.size
    image = image.repeat(batch_size, axis=0)
    # Output shapes expected by the post-processor
    output_shapes = output_shape
    # Do inference with TensorRT
    trt_outputs = []
    with get_engine(onnx_file_path, engine_file_path) as engine, engine.create_execution_context() as context:
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        # Do inference
        print('Running inference on image {}...'.format(input_image_path))
        # Set host input to the image. The common.do_inference function will copy the input to the GPU before executing.
        inputs[0].host = image
        trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
        nbr_frame = 100
        counter = 0
        sum_FPS = 0
        while (counter < nbr_frame):
            t0 = time.time()
            trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs,
                                                 stream=stream)
            t1 = time.time()
            counter += 1
            FPS = batch_size / (t1 - t0)
            sum_FPS = sum_FPS + FPS
            AVG_FPS = sum_FPS / counter
            print("Latency = {:.2f}ms | FPS = {:.2f} | AVG_FPS = {:.2f}".format(
                (t1 - t0) * 1000, FPS, AVG_FPS))

    # Before doing post-processing, we need to reshape the outputs as the common.do_inference will give us flat arrays.

    trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, output_shapes)]

    postprocessor_args = {"yolo_masks": [(6, 7, 8), (3, 4, 5), (0, 1, 2)],
                          # A list of 3 three-dimensional tuples for the YOLO masks
                          "yolo_anchors": [(10, 13), (16, 30), (33, 23), (30, 61), (62, 45),
                                           # A list of 9 two-dimensional tuples for the YOLO anchors
                                           (59, 119), (116, 90), (156, 198), (373, 326)],
                          "obj_threshold": 0.3,  # Threshold for object coverage, float value between 0 and 1
                          "nms_threshold": 0.5,
                          # Threshold for non-max suppression algorithm, float value between 0 and 1
                          "yolo_input_resolution": input_resolution_yolov3_HW}

    postprocessor = PostprocessYOLO(**postprocessor_args)
    t2 = time.time()
    # Run the post-processing algorithms on the TensorRT outputs and get the bounding box details of detected objects
    boxes, classes, scores = postprocessor.process(trt_outputs, (shape_orig_WH))
    t3 = time.time()
    # Draw the bounding boxes onto the original input image and save it as a PNG file
    obj_detected_img = draw_bboxes(image_raw, boxes, scores, classes, ALL_CATEGORIES)
    t4 = time.time()
    output_image_path = args.i+"_out_"+args.m+"_"+args.r+"_"+args.p+".png"
    obj_detected_img.save(output_image_path, 'PNG')
    print('Saved image with bounding boxes of detected objects to {}.'.format(output_image_path))
    print("Latency = {:.2f}ms | FPS = {:.2f} | post processing = {:.2f}ms | drawing = {:.2f}ms".format((t1 - t0) * 1000,
                                                                                                       1 / (t1 - t0),
                                                                                                       (t3 - t2) * 1000,
                                                                                                       (t4 - t3) * 1000))


if __name__ == '__main__':
    main()

To execute the script with a batch of 1 i use this command :

sudo python3 onnx_to_tensorrt.py -i dog -m tiny -r 416 -p INT8 -b 1

OUTPUT:

Reading engine from file yolov3-tiny-416_INT8.trt
Running inference on image dog.jpg...
Latency = 1.86ms | FPS = 538.91 | AVG_FPS = 538.91
Latency = 1.82ms | FPS = 548.63 | AVG_FPS = 543.77
Latency = 1.80ms | FPS = 555.76 | AVG_FPS = 547.77
Latency = 1.85ms | FPS = 539.11 | AVG_FPS = 545.60
Latency = 1.99ms | FPS = 502.73 | AVG_FPS = 537.03
Latency = 2.24ms | FPS = 447.30 | AVG_FPS = 522.07
Latency = 2.09ms | FPS = 478.42 | AVG_FPS = 515.84
Latency = 1.90ms | FPS = 527.06 | AVG_FPS = 517.24
Latency = 1.81ms | FPS = 551.59 | AVG_FPS = 521.06
Latency = 1.84ms | FPS = 544.64 | AVG_FPS = 523.42
Latency = 1.81ms | FPS = 553.05 | AVG_FPS = 526.11
Latency = 2.02ms | FPS = 494.67 | AVG_FPS = 523.49
Latency = 1.78ms | FPS = 560.89 | AVG_FPS = 526.37
Latency = 2.01ms | FPS = 496.66 | AVG_FPS = 524.24
Latency = 2.62ms | FPS = 382.10 | AVG_FPS = 514.77
Latency = 2.38ms | FPS = 420.06 | AVG_FPS = 508.85
Latency = 2.24ms | FPS = 446.96 | AVG_FPS = 505.21
Latency = 1.83ms | FPS = 545.71 | AVG_FPS = 507.46
Latency = 1.79ms | FPS = 559.54 | AVG_FPS = 510.20
Latency = 1.98ms | FPS = 505.16 | AVG_FPS = 509.95
Latency = 1.80ms | FPS = 555.91 | AVG_FPS = 512.14
Latency = 1.75ms | FPS = 570.65 | AVG_FPS = 514.80
Latency = 1.77ms | FPS = 565.12 | AVG_FPS = 516.98
Latency = 1.78ms | FPS = 562.47 | AVG_FPS = 518.88
Latency = 1.88ms | FPS = 532.20 | AVG_FPS = 519.41
Latency = 1.76ms | FPS = 568.72 | AVG_FPS = 521.31
Latency = 1.76ms | FPS = 568.64 | AVG_FPS = 523.06
Latency = 1.83ms | FPS = 545.78 | AVG_FPS = 523.87
Latency = 2.75ms | FPS = 364.06 | AVG_FPS = 518.36
Latency = 2.44ms | FPS = 410.24 | AVG_FPS = 514.76
Latency = 2.15ms | FPS = 465.00 | AVG_FPS = 513.15
Latency = 1.83ms | FPS = 547.34 | AVG_FPS = 514.22
Latency = 1.82ms | FPS = 550.07 | AVG_FPS = 515.31
Latency = 1.80ms | FPS = 556.86 | AVG_FPS = 516.53
Latency = 1.78ms | FPS = 560.66 | AVG_FPS = 517.79
Latency = 1.85ms | FPS = 540.85 | AVG_FPS = 518.43
Latency = 1.97ms | FPS = 506.93 | AVG_FPS = 518.12
Latency = 1.77ms | FPS = 563.52 | AVG_FPS = 519.31
Latency = 1.81ms | FPS = 551.74 | AVG_FPS = 520.15
Latency = 1.80ms | FPS = 555.76 | AVG_FPS = 521.04
Latency = 1.83ms | FPS = 547.77 | AVG_FPS = 521.69
Latency = 1.77ms | FPS = 563.67 | AVG_FPS = 522.69
Latency = 1.85ms | FPS = 541.62 | AVG_FPS = 523.13
Latency = 2.17ms | FPS = 460.20 | AVG_FPS = 521.70
Latency = 2.81ms | FPS = 355.30 | AVG_FPS = 518.00
Latency = 2.16ms | FPS = 462.28 | AVG_FPS = 516.79
Latency = 1.87ms | FPS = 535.81 | AVG_FPS = 517.19
Latency = 1.81ms | FPS = 552.83 | AVG_FPS = 517.94
Latency = 1.81ms | FPS = 551.59 | AVG_FPS = 518.62
Latency = 1.81ms | FPS = 551.88 | AVG_FPS = 519.29
Latency = 1.78ms | FPS = 560.59 | AVG_FPS = 520.10
Latency = 1.79ms | FPS = 558.94 | AVG_FPS = 520.85
Latency = 1.96ms | FPS = 510.63 | AVG_FPS = 520.65
Latency = 1.82ms | FPS = 550.22 | AVG_FPS = 521.20
Latency = 1.79ms | FPS = 559.54 | AVG_FPS = 521.90
Latency = 1.78ms | FPS = 560.59 | AVG_FPS = 522.59
Latency = 1.83ms | FPS = 547.70 | AVG_FPS = 523.03
Latency = 1.76ms | FPS = 566.72 | AVG_FPS = 523.78
Latency = 1.84ms | FPS = 543.02 | AVG_FPS = 524.11
Latency = 2.64ms | FPS = 378.51 | AVG_FPS = 521.68
Latency = 2.40ms | FPS = 417.14 | AVG_FPS = 519.97
Latency = 2.14ms | FPS = 467.49 | AVG_FPS = 519.12
Latency = 1.82ms | FPS = 550.22 | AVG_FPS = 519.61
Latency = 1.80ms | FPS = 554.44 | AVG_FPS = 520.16
Latency = 1.81ms | FPS = 552.61 | AVG_FPS = 520.66
Latency = 1.78ms | FPS = 561.79 | AVG_FPS = 521.28
Latency = 1.80ms | FPS = 556.79 | AVG_FPS = 521.81
Latency = 1.79ms | FPS = 559.54 | AVG_FPS = 522.37
Latency = 1.79ms | FPS = 559.84 | AVG_FPS = 522.91
Latency = 2.11ms | FPS = 474.42 | AVG_FPS = 522.22
Latency = 1.79ms | FPS = 558.05 | AVG_FPS = 522.72
Latency = 1.82ms | FPS = 550.36 | AVG_FPS = 523.10
Latency = 1.76ms | FPS = 568.95 | AVG_FPS = 523.73
Latency = 1.77ms | FPS = 564.59 | AVG_FPS = 524.28
Latency = 2.22ms | FPS = 450.42 | AVG_FPS = 523.30
Latency = 2.69ms | FPS = 371.44 | AVG_FPS = 521.30
Latency = 2.23ms | FPS = 448.88 | AVG_FPS = 520.36
Latency = 2.27ms | FPS = 439.79 | AVG_FPS = 519.33
Latency = 1.92ms | FPS = 520.90 | AVG_FPS = 519.35
Latency = 1.95ms | FPS = 512.44 | AVG_FPS = 519.26
Latency = 1.82ms | FPS = 548.63 | AVG_FPS = 519.62
Latency = 1.88ms | FPS = 530.66 | AVG_FPS = 519.76
Latency = 1.83ms | FPS = 546.70 | AVG_FPS = 520.08
Latency = 1.83ms | FPS = 547.56 | AVG_FPS = 520.41
Latency = 1.83ms | FPS = 545.35 | AVG_FPS = 520.70
Latency = 1.80ms | FPS = 556.20 | AVG_FPS = 521.12
Latency = 1.80ms | FPS = 555.39 | AVG_FPS = 521.51
Latency = 1.84ms | FPS = 544.86 | AVG_FPS = 521.78
Latency = 1.80ms | FPS = 556.86 | AVG_FPS = 522.17
Latency = 1.84ms | FPS = 543.73 | AVG_FPS = 522.41
Latency = 2.68ms | FPS = 373.19 | AVG_FPS = 520.77
Latency = 2.48ms | FPS = 402.60 | AVG_FPS = 519.49
Latency = 2.25ms | FPS = 444.92 | AVG_FPS = 518.68
Latency = 2.10ms | FPS = 475.81 | AVG_FPS = 518.23
Latency = 1.86ms | FPS = 536.22 | AVG_FPS = 518.42
Latency = 1.78ms | FPS = 561.56 | AVG_FPS = 518.87
Latency = 1.81ms | FPS = 551.74 | AVG_FPS = 519.21
Latency = 1.86ms | FPS = 538.77 | AVG_FPS = 519.41
Latency = 1.82ms | FPS = 550.22 | AVG_FPS = 519.72
Latency = 1.83ms | FPS = 547.56 | AVG_FPS = 519.99
[[108.28476079 188.27050212 286.63875325 355.7010887 ]
 [202.18680365 171.43988028 386.73458023 301.09193915]
 [429.43957671  76.93088856 286.18962092  93.58978557]
 [504.3033231   62.065803   146.39240185 126.5069848 ]] [0.81173893 0.3637773  0.79401759 0.74956349] [16  1  2  2]
Saved image with bounding boxes of detected objects to dog_out_tiny_416_INT8.png.
Latency = 1.83ms | FPS = 547.56 | post processing = 184.86ms | drawing = 8.53ms

OUTPUT Image:

IMPORTANT: With a batch size of 8 i got 700 FPS (Latency = 11.32ms) but with a batch >= 16 the FPS decrease, i didn’t add post-processing and drawing box latency to calculate the latency of the inference (i only took into consideration the do_inference_v2() latency)

Had i done something wrong ? how can i increase the FPS to 1000 ?

AastaLLL · May 19, 2021, 3:49am

Hi,

Have you also set the Xavier into performance mode?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

The performance table can be reproduced with the source below:

Thanks.

chakibdace · May 19, 2021, 8:01am

Hi,

Thanks for answer, yes i’am using the mode 0 (MAXN mode) as power mode and INT8 precision as precision mode.
Am i getting this performance because of using only the GPU to run the inference ? despite of running the inference on GPU and the two DLA core simultaneously ?

Thanks you

AastaLLL · May 24, 2021, 5:26am

Hi,

Have you maximized the Xavier clocks?

$ sudo jetson_clocks

Thanks.

chakibdace · May 25, 2021, 9:23am

Hi @AastaLLL ,

Yes i maximized the Xavier clocks, i tried the link you provided me and i was able to reproduce the benchmarks, so if i truly understand we can reproduce the benchmarks with the executable trtexec but we cannot get the best perfs with tensorrt python API

Could you suggest some documentation about trtexec to compare between this latter and the tensorrt python API

Thank you

AastaLLL · June 8, 2021, 8:01am

Hi,

Please find an example in the below comment:

Thanks.

Topic		Replies	Views
Python sample yolov3 app on tensorrt Jetson Xavier NX tensorrt , yolo , python	8	1857	August 3, 2020
Tensorflow model acceleration on AGX Jetson AGX Xavier tensorflow	13	1404	October 7, 2022
Yolov3 FPS on TensorRT Jetson AGX Xavier tensorrt	25	7607	August 6, 2020
Difference between running the inference with trtexec and tensorrt python API Jetson AGX Xavier tensorrt , python	3	3212	May 27, 2021
Inference Speed Jetson Xavier NX pytorch	5	1119	March 30, 2023
ONNX model with Jetson-Inference using GPU Jetson Xavier NX tensorrt , jetson-inference , onnx	37	6229	April 16, 2021
Yolov5 slow inference on Jetson Xavier NX16 Jetson Xavier NX ai	9	1834	September 29, 2022
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	10	4582	April 12, 2022
Trtexec performance Jetson TX2 jetpack , tensorrt	5	3577	June 17, 2020
run yolov3-tiny with tensorRT model Jetson Nano	7	3594	January 4, 2020

How can i get 1000 FPS by running the inference with TensorRT Tiny-YOLOv3 (Jetson AGX Xavier)

Related topics