TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference

Description

here is my tensorrt inference script.
with this script, inferencing one frame it is taking 1.5-sec which means 0.5fps. I want it t have a better fps.
I’m sharing my script below.

particularly,context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])
this line is taking 1.4sec to run

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

def allocate_buffers(engine, batch_size, data_type):

“”"
This is the function to allocate buffers for input and output in the device
Args:
engine : The path to the TensorRT engine.
batch_size : The batch size for execution time.
data_type: The type of the data for input and output, for example trt.float32.

Output:
h_input_1: Input in the host.
d_input_1: Input in the device.
h_output_1: Output in the host.
d_output_1: Output in the device.
stream: CUDA stream.

“”"

Determine dimensions and create page-locked memory buffers (which won’t be swapped to disk) to hold host inputs/outputs.

h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))

Allocate device memory for inputs and outputs.

d_input_1 = cuda.mem_alloc(h_input_1.nbytes)

d_output = cuda.mem_alloc(h_output.nbytes)

Create a stream in which to copy inputs/outputs and run inference.

stream = cuda.Stream()
return h_input_1, d_input_1, h_output, d_output, stream

def load_images_to_buffer(pics, pagelocked_buffer):

preprocessed = np.asarray(pics).ravel()
np.copyto(pagelocked_buffer, preprocessed)

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):

“”"
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input_1: Input in the host.
d_input_1: Input in the device.
h_output_1: Output in the host.
d_output_1: Output in the device.
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.

Output:
The list of output images.

“”"

load_images_to_buffer(pics_1, h_input_1)

with engine.create_execution_context() as context:

Transfer input data to the GPU.

cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

   # Run inference.

   context.profiler = trt.Profiler()
   context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])

   # Transfer predictions back from the GPU.
   cuda.memcpy_dtoh_async(h_output, d_output, stream)
   # Synchronize the stream.
   stream.synchronize()
   # Return the host output.
   out = h_output.reshape((batch_size,-1, height, width))
   return out

here is my trtexec output

&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --loadEngine=unethalf_engine.engine --shapes=modelInput:13 7201280
[03/13/2023-18:55:38] [I] === Model Options ===
[03/13/2023-18:55:38] [I] Format: *
[03/13/2023-18:55:38] [I] Model:
[03/13/2023-18:55:38] [I] Output:
[03/13/2023-18:55:38] [I] === Build Options ===
[03/13/2023-18:55:38] [I] Max batch: explicit batch
[03/13/2023-18:55:38] [I] Workspace: 16 MiB
[03/13/2023-18:55:38] [I] minTiming: 1
[03/13/2023-18:55:38] [I] avgTiming: 8
[03/13/2023-18:55:38] [I] Precision: FP32
[03/13/2023-18:55:38] [I] Calibration:
[03/13/2023-18:55:38] [I] Refit: Disabled
[03/13/2023-18:55:38] [I] Sparsity: Disabled
[03/13/2023-18:55:38] [I] Safe mode: Disabled
[03/13/2023-18:55:38] [I] DirectIO mode: Disabled
[03/13/2023-18:55:38] [I] Restricted mode: Disabled
[03/13/2023-18:55:38] [I] Save engine:
[03/13/2023-18:55:38] [I] Load engine: unethalf_engine.engine
[03/13/2023-18:55:38] [I] Profiling verbosity: 0
[03/13/2023-18:55:38] [I] Tactic sources: Using default tactic sources
[03/13/2023-18:55:38] [I] timingCacheMode: local
[03/13/2023-18:55:38] [I] timingCacheFile:
[03/13/2023-18:55:38] [I] Input(s)s format: fp32:CHW
[03/13/2023-18:55:38] [I] Output(s)s format: fp32:CHW
[03/13/2023-18:55:38] [I] Input build shape: modelInput=1+1+1
[03/13/2023-18:55:38] [I] Input calibration shapes: model
[03/13/2023-18:55:38] [I] === System Options ===
[03/13/2023-18:55:38] [I] Device: 0
[03/13/2023-18:55:38] [I] DLACore:
[03/13/2023-18:55:38] [I] Plugins:
[03/13/2023-18:55:38] [I] === Inference Options ===
[03/13/2023-18:55:38] [I] Batch: Explicit
[03/13/2023-18:55:38] [I] Input inference shape: modelInput=1
[03/13/2023-18:55:38] [I] Iterations: 10
[03/13/2023-18:55:38] [I] Duration: 3s (+ 200ms warm up)
[03/13/2023-18:55:38] [I] Sleep time: 0ms
[03/13/2023-18:55:38] [I] Idle time: 0ms
[03/13/2023-18:55:38] [I] Streams: 1
[03/13/2023-18:55:38] [I] ExposeDMA: Disabled
[03/13/2023-18:55:38] [I] Data transfers: Enabled
[03/13/2023-18:55:38] [I] Spin-wait: Disabled
[03/13/2023-18:55:38] [I] Multithreading: Disabled
[03/13/2023-18:55:38] [I] CUDA Graph: Disabled
[03/13/2023-18:55:38] [I] Separate profiling: Disabled
[03/13/2023-18:55:38] [I] Time Deserialize: Disabled
[03/13/2023-18:55:38] [I] Time Refit: Disabled
[03/13/2023-18:55:38] [I] Skip inference: Disabled
[03/13/2023-18:55:38] [I] Inputs:
[03/13/2023-18:55:38] [I] === Reporting Options ===
[03/13/2023-18:55:38] [I] Verbose: Disabled
[03/13/2023-18:55:38] [I] Averages: 10 inferences
[03/13/2023-18:55:38] [I] Percentile: 99
[03/13/2023-18:55:38] [I] Dump refittable layers:Disabled
[03/13/2023-18:55:38] [I] Dump output: Disabled
[03/13/2023-18:55:38] [I] Profile: Disabled
[03/13/2023-18:55:38] [I] Export timing to JSON file:
[03/13/2023-18:55:38] [I] Export output to JSON file:
[03/13/2023-18:55:38] [I] Export profile to JSON file:
[03/13/2023-18:55:38] [I]
[03/13/2023-18:55:38] [I] === Device Information ===
[03/13/2023-18:55:38] [I] Selected Device: NVIDIA Tegra X1
[03/13/2023-18:55:38] [I] Compute Capability: 5.3
[03/13/2023-18:55:38] [I] SMs: 1
[03/13/2023-18:55:38] [I] Compute Clock Rate: 0.9216 GHz
[03/13/2023-18:55:38] [I] Device Global Memory: 3964 MiB
[03/13/2023-18:55:38] [I] Shared Memory per SM: 64 KiB
[03/13/2023-18:55:38] [I] Memory Bus Width: 64 bits (ECC disabled)
[03/13/2023-18:55:38] [I] Memory Clock Rate: 0.01275 GHz
[03/13/2023-18:55:38] [I]
[03/13/2023-18:55:38] [I] TensorRT version: 8.2.1
[03/13/2023-18:55:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +229, GPU +0, now: CPU 289, GPU 2525 (MiB)
[03/13/2023-18:55:40] [I] [TRT] Loaded engine size: 41 MiB
[03/13/2023-18:55:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +160, now: CPU 448, GPU 2686 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +240, GPU +243, now: CPU 688, GPU 2929 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[03/13/2023-18:55:42] [I] Engine loaded in 4.66617 sec.
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 647, GPU 2887 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 647, GPU 2887 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +192, now: CPU 0, GPU 233 (MiB)
[03/13/2023-18:55:42] [I] Using random values for input modelInput
[03/13/2023-18:55:43] [I] Created input binding for modelInput with dimensions 1x3x720x1280
[03/13/2023-18:55:43] [I] Using random values for output modelOutput
[03/13/2023-18:55:43] [I] Created output binding for modelOutput with dimensions 1x1x720x1280
[03/13/2023-18:55:43] [I] Starting inference
[03/13/2023-18:56:00] [I] Warmup completed 1 queries over 200 ms
[03/13/2023-18:56:00] [I] Timing trace has 10 queries over 14.5163 s
[03/13/2023-18:56:00] [I]
[03/13/2023-18:56:00] [I] === Trace details ===
[03/13/2023-18:56:00] [I] Trace averages of 10 runs:
[03/13/2023-18:56:00] [I] Average on 10 runs - GPU latency: 1450.18 ms - Host latency: 1451.62 ms (end to end 1451.63 ms, enqueue 13.3297 ms)
[03/13/2023-18:56:00] [I]
[03/13/2023-18:56:00] [I] === Performance summary ===
[03/13/2023-18:56:00] [I] Throughput: 0.68888 qps
[03/13/2023-18:56:00] [I] Latency: min = 1445.93 ms, max = 1456.88 ms, mean = 1451.62 ms, median = 1451.2 ms, percentile(99%) = 1456.88 ms
[03/13/2023-18:56:00] [I] End-to-End Host Latency: min = 1445.94 ms, max = 1456.9 ms, mean = 1451.63 ms, median = 1451.22 ms, percentile(99%) = 1456.9 ms
[03/13/2023-18:56:00] [I] Enqueue Time: min = 2.87402 ms, max = 17.4658 ms, mean = 13.3297 ms, median = 13.9668 ms, percentile(99%) = 17.4658 ms
[03/13/2023-18:56:00] [I] H2D Latency: min = 1.07715 ms, max = 1.08496 ms, mean = 1.08086 ms, median = 1.08057 ms, percentile(99%) = 1.08496 ms
[03/13/2023-18:56:00] [I] GPU Compute Time: min = 1444.49 ms, max = 1455.45 ms, mean = 1450.18 ms, median = 1449.76 ms, percentile(99%) = 1455.45 ms
[03/13/2023-18:56:00] [I] D2H Latency: min = 0.355469 ms, max = 0.358398 ms, mean = 0.357129 ms, median = 0.357422 ms, percentile(99%) = 0.358398 ms
[03/13/2023-18:56:00] [I] Total Host Walltime: 14.5163 s
[03/13/2023-18:56:00] [I] Total GPU Compute Time: 14.5018 s
[03/13/2023-18:56:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/13/2023-18:56:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --loadEngine=unethalf_engine.engine --shapes=modelInput:1
3720 1280

Environment

TensorRT Version: 8.2.x
GPU Type: Jetsonnano
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:

Thanks!