TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference

kalyanichagala11 · March 13, 2023, 1:54pm

Description

here is my tensorrt inference script.
with this script, inferencing one frame it is taking 1.5-sec which means 0.5fps. I want it t have a better fps.
I’m sharing my script below.

particularly,context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])
this line is taking 1.4sec to run

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

def allocate_buffers(engine, batch_size, data_type):

“”"
This is the function to allocate buffers for input and output in the device
Args:
engine : The path to the TensorRT engine.
batch_size : The batch size for execution time.
data_type: The type of the data for input and output, for example trt.float32.

Output:
h_input_1: Input in the host.
d_input_1: Input in the device.
h_output_1: Output in the host.
d_output_1: Output in the device.
stream: CUDA stream.

“”"

Determine dimensions and create page-locked memory buffers (which won’t be swapped to disk) to hold host inputs/outputs.

h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))

Allocate device memory for inputs and outputs.

d_input_1 = cuda.mem_alloc(h_input_1.nbytes)

d_output = cuda.mem_alloc(h_output.nbytes)

Create a stream in which to copy inputs/outputs and run inference.

stream = cuda.Stream()
return h_input_1, d_input_1, h_output, d_output, stream

def load_images_to_buffer(pics, pagelocked_buffer):

preprocessed = np.asarray(pics).ravel()
np.copyto(pagelocked_buffer, preprocessed)

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):

“”"
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input_1: Input in the host.
d_input_1: Input in the device.
h_output_1: Output in the host.
d_output_1: Output in the device.
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.

Output:
The list of output images.

“”"

load_images_to_buffer(pics_1, h_input_1)

with engine.create_execution_context() as context:

Transfer input data to the GPU.

cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

   # Run inference.

   context.profiler = trt.Profiler()
   context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])

   # Transfer predictions back from the GPU.
   cuda.memcpy_dtoh_async(h_output, d_output, stream)
   # Synchronize the stream.
   stream.synchronize()
   # Return the host output.
   out = h_output.reshape((batch_size,-1, height, width))
   return out

here is my trtexec output

&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --loadEngine=unethalf_engine.engine --shapes=modelInput:13 7201280
[03/13/2023-18:55:38] [I] === Model Options ===
[03/13/2023-18:55:38] [I] Format: *
[03/13/2023-18:55:38] [I] Model:
[03/13/2023-18:55:38] [I] Output:
[03/13/2023-18:55:38] [I] === Build Options ===
[03/13/2023-18:55:38] [I] Max batch: explicit batch
[03/13/2023-18:55:38] [I] Workspace: 16 MiB
[03/13/2023-18:55:38] [I] minTiming: 1
[03/13/2023-18:55:38] [I] avgTiming: 8
[03/13/2023-18:55:38] [I] Precision: FP32
[03/13/2023-18:55:38] [I] Calibration:
[03/13/2023-18:55:38] [I] Refit: Disabled
[03/13/2023-18:55:38] [I] Sparsity: Disabled
[03/13/2023-18:55:38] [I] Safe mode: Disabled
[03/13/2023-18:55:38] [I] DirectIO mode: Disabled
[03/13/2023-18:55:38] [I] Restricted mode: Disabled
[03/13/2023-18:55:38] [I] Save engine:
[03/13/2023-18:55:38] [I] Load engine: unethalf_engine.engine
[03/13/2023-18:55:38] [I] Profiling verbosity: 0
[03/13/2023-18:55:38] [I] Tactic sources: Using default tactic sources
[03/13/2023-18:55:38] [I] timingCacheMode: local
[03/13/2023-18:55:38] [I] timingCacheFile:
[03/13/2023-18:55:38] [I] Input(s)s format: fp32:CHW
[03/13/2023-18:55:38] [I] Output(s)s format: fp32:CHW
[03/13/2023-18:55:38] [I] Input build shape: modelInput=1+1+1
[03/13/2023-18:55:38] [I] Input calibration shapes: model
[03/13/2023-18:55:38] [I] === System Options ===
[03/13/2023-18:55:38] [I] Device: 0
[03/13/2023-18:55:38] [I] DLACore:
[03/13/2023-18:55:38] [I] Plugins:
[03/13/2023-18:55:38] [I] === Inference Options ===
[03/13/2023-18:55:38] [I] Batch: Explicit
[03/13/2023-18:55:38] [I] Input inference shape: modelInput=1
[03/13/2023-18:55:38] [I] Iterations: 10
[03/13/2023-18:55:38] [I] Duration: 3s (+ 200ms warm up)
[03/13/2023-18:55:38] [I] Sleep time: 0ms
[03/13/2023-18:55:38] [I] Idle time: 0ms
[03/13/2023-18:55:38] [I] Streams: 1
[03/13/2023-18:55:38] [I] ExposeDMA: Disabled
[03/13/2023-18:55:38] [I] Data transfers: Enabled
[03/13/2023-18:55:38] [I] Spin-wait: Disabled
[03/13/2023-18:55:38] [I] Multithreading: Disabled
[03/13/2023-18:55:38] [I] CUDA Graph: Disabled
[03/13/2023-18:55:38] [I] Separate profiling: Disabled
[03/13/2023-18:55:38] [I] Time Deserialize: Disabled
[03/13/2023-18:55:38] [I] Time Refit: Disabled
[03/13/2023-18:55:38] [I] Skip inference: Disabled
[03/13/2023-18:55:38] [I] Inputs:
[03/13/2023-18:55:38] [I] === Reporting Options ===
[03/13/2023-18:55:38] [I] Verbose: Disabled
[03/13/2023-18:55:38] [I] Averages: 10 inferences
[03/13/2023-18:55:38] [I] Percentile: 99
[03/13/2023-18:55:38] [I] Dump refittable layers:Disabled
[03/13/2023-18:55:38] [I] Dump output: Disabled
[03/13/2023-18:55:38] [I] Profile: Disabled
[03/13/2023-18:55:38] [I] Export timing to JSON file:
[03/13/2023-18:55:38] [I] Export output to JSON file:
[03/13/2023-18:55:38] [I] Export profile to JSON file:
[03/13/2023-18:55:38] [I]
[03/13/2023-18:55:38] [I] === Device Information ===
[03/13/2023-18:55:38] [I] Selected Device: NVIDIA Tegra X1
[03/13/2023-18:55:38] [I] Compute Capability: 5.3
[03/13/2023-18:55:38] [I] SMs: 1
[03/13/2023-18:55:38] [I] Compute Clock Rate: 0.9216 GHz
[03/13/2023-18:55:38] [I] Device Global Memory: 3964 MiB
[03/13/2023-18:55:38] [I] Shared Memory per SM: 64 KiB
[03/13/2023-18:55:38] [I] Memory Bus Width: 64 bits (ECC disabled)
[03/13/2023-18:55:38] [I] Memory Clock Rate: 0.01275 GHz
[03/13/2023-18:55:38] [I]
[03/13/2023-18:55:38] [I] TensorRT version: 8.2.1
[03/13/2023-18:55:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +229, GPU +0, now: CPU 289, GPU 2525 (MiB)
[03/13/2023-18:55:40] [I] [TRT] Loaded engine size: 41 MiB
[03/13/2023-18:55:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +160, now: CPU 448, GPU 2686 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +240, GPU +243, now: CPU 688, GPU 2929 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[03/13/2023-18:55:42] [I] Engine loaded in 4.66617 sec.
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 647, GPU 2887 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 647, GPU 2887 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +192, now: CPU 0, GPU 233 (MiB)
[03/13/2023-18:55:42] [I] Using random values for input modelInput
[03/13/2023-18:55:43] [I] Created input binding for modelInput with dimensions 1x3x720x1280
[03/13/2023-18:55:43] [I] Using random values for output modelOutput
[03/13/2023-18:55:43] [I] Created output binding for modelOutput with dimensions 1x1x720x1280
[03/13/2023-18:55:43] [I] Starting inference
[03/13/2023-18:56:00] [I] Warmup completed 1 queries over 200 ms
[03/13/2023-18:56:00] [I] Timing trace has 10 queries over 14.5163 s
[03/13/2023-18:56:00] [I]
[03/13/2023-18:56:00] [I] === Trace details ===
[03/13/2023-18:56:00] [I] Trace averages of 10 runs:
[03/13/2023-18:56:00] [I] Average on 10 runs - GPU latency: 1450.18 ms - Host latency: 1451.62 ms (end to end 1451.63 ms, enqueue 13.3297 ms)
[03/13/2023-18:56:00] [I]
[03/13/2023-18:56:00] [I] === Performance summary ===
[03/13/2023-18:56:00] [I] Throughput: 0.68888 qps
[03/13/2023-18:56:00] [I] Latency: min = 1445.93 ms, max = 1456.88 ms, mean = 1451.62 ms, median = 1451.2 ms, percentile(99%) = 1456.88 ms
[03/13/2023-18:56:00] [I] End-to-End Host Latency: min = 1445.94 ms, max = 1456.9 ms, mean = 1451.63 ms, median = 1451.22 ms, percentile(99%) = 1456.9 ms
[03/13/2023-18:56:00] [I] Enqueue Time: min = 2.87402 ms, max = 17.4658 ms, mean = 13.3297 ms, median = 13.9668 ms, percentile(99%) = 17.4658 ms
[03/13/2023-18:56:00] [I] H2D Latency: min = 1.07715 ms, max = 1.08496 ms, mean = 1.08086 ms, median = 1.08057 ms, percentile(99%) = 1.08496 ms
[03/13/2023-18:56:00] [I] GPU Compute Time: min = 1444.49 ms, max = 1455.45 ms, mean = 1450.18 ms, median = 1449.76 ms, percentile(99%) = 1455.45 ms
[03/13/2023-18:56:00] [I] D2H Latency: min = 0.355469 ms, max = 0.358398 ms, mean = 0.357129 ms, median = 0.357422 ms, percentile(99%) = 0.358398 ms
[03/13/2023-18:56:00] [I] Total Host Walltime: 14.5163 s
[03/13/2023-18:56:00] [I] Total GPU Compute Time: 14.5018 s
[03/13/2023-18:56:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/13/2023-18:56:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --loadEngine=unethalf_engine.engine --shapes=modelInput:1 3720 1280

Environment

TensorRT Version: 8.2.x
GPU Type: Jetsonnano
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

NVES · March 13, 2023, 2:07pm

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

github.com

onnx/onnx-tensorrt/blob/main/docs/operators.md

<!--- SPDX-License-Identifier: Apache-2.0 -->

# Supported ONNX Operators

TensorRT 8.5 supports operators up to Opset 17. Latest information of ONNX operators can be found [here](https://github.com/onnx/onnx/blob/master/docs/Operators.md)

TensorRT supports the following ONNX data types: DOUBLE, FLOAT32, FLOAT16, INT8, and BOOL

> Note: There is limited support for INT32, INT64, and DOUBLE types. TensorRT will attempt to cast down INT64 to INT32 and DOUBLE down to FLOAT, clamping values to `+-INT_MAX` or `+-FLT_MAX` if necessary.

See below for the support matrix of ONNX operators in ONNX-TensorRT.

## Operator Support Matrix

| Operator                  | Supported  | Supported Types | Restrictions                                                                                                           |
|---------------------------|------------|-----------------|------------------------------------------------------------------------------------------------------------------------|
| Abs                       | Y          | FP32, FP16, INT32 |
| Acos                      | Y          | FP32, FP16 |
| Acosh                     | Y          | FP32, FP16 |
| Add                       | Y          | FP32, FP16, INT32 |

This file has been truncated. show original

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!

Topic		Replies	Views
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	762	March 13, 2023
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4387	April 12, 2022
Low FPS on Jetson Nano using TensorRT Jetson Nano tensorrt , tensorflow	7	1212	August 27, 2020
Inference result gets worse when converting pytorch model to TensorRT model TensorRT pytorch	6	1145	January 19, 2022
Tensorrt8.5 inference different with origin onnx model TensorRT	6	1092	December 13, 2022
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1415	July 12, 2022
Tensorrt inference with batch > 1 TensorRT	4	1392	October 13, 2022
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1883	June 14, 2021
TensorRT --- non-int8 fallback when trying to calibrate ONNX model DeepStream SDK tensorrt , deepstream	11	438	July 1, 2024
TensorRT inference process TensorRT	4	641	May 17, 2021