Description
here is my tensorrt inference script.
with this script, inferencing one frame it is taking 1.5-sec which means 0.5fps. I want it t have a better fps.
I’m sharing my script below.
particularly,context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])
this line is taking 1.4sec to run
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
def allocate_buffers(engine, batch_size, data_type):
“”"
This is the function to allocate buffers for input and output in the device
Args:
engine : The path to the TensorRT engine.
batch_size : The batch size for execution time.
data_type: The type of the data for input and output, for example trt.float32.
Output:
h_input_1: Input in the host.
d_input_1: Input in the device.
h_output_1: Output in the host.
d_output_1: Output in the device.
stream: CUDA stream.
“”"
Determine dimensions and create page-locked memory buffers (which won’t be swapped to disk) to hold host inputs/outputs.
h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
Allocate device memory for inputs and outputs.
d_input_1 = cuda.mem_alloc(h_input_1.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
return h_input_1, d_input_1, h_output, d_output, stream
def load_images_to_buffer(pics, pagelocked_buffer):
preprocessed = np.asarray(pics).ravel()
np.copyto(pagelocked_buffer, preprocessed)
def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):
“”"
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input_1: Input in the host.
d_input_1: Input in the device.
h_output_1: Output in the host.
d_output_1: Output in the device.
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.
Output:
The list of output images.
“”"
load_images_to_buffer(pics_1, h_input_1)
with engine.create_execution_context() as context:
Transfer input data to the GPU.
cuda.memcpy_htod_async(d_input_1, h_input_1, stream)
# Run inference.
context.profiler = trt.Profiler()
context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])
# Transfer predictions back from the GPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream.
stream.synchronize()
# Return the host output.
out = h_output.reshape((batch_size,-1, height, width))
return out
here is my trtexec output
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # trtexec --loadEngine=unethalf_engine.engine --shapes=modelInput:13 7201280
[03/13/2023-18:55:38] [I] === Model Options ===
[03/13/2023-18:55:38] [I] Format: *
[03/13/2023-18:55:38] [I] Model:
[03/13/2023-18:55:38] [I] Output:
[03/13/2023-18:55:38] [I] === Build Options ===
[03/13/2023-18:55:38] [I] Max batch: explicit batch
[03/13/2023-18:55:38] [I] Workspace: 16 MiB
[03/13/2023-18:55:38] [I] minTiming: 1
[03/13/2023-18:55:38] [I] avgTiming: 8
[03/13/2023-18:55:38] [I] Precision: FP32
[03/13/2023-18:55:38] [I] Calibration:
[03/13/2023-18:55:38] [I] Refit: Disabled
[03/13/2023-18:55:38] [I] Sparsity: Disabled
[03/13/2023-18:55:38] [I] Safe mode: Disabled
[03/13/2023-18:55:38] [I] DirectIO mode: Disabled
[03/13/2023-18:55:38] [I] Restricted mode: Disabled
[03/13/2023-18:55:38] [I] Save engine:
[03/13/2023-18:55:38] [I] Load engine: unethalf_engine.engine
[03/13/2023-18:55:38] [I] Profiling verbosity: 0
[03/13/2023-18:55:38] [I] Tactic sources: Using default tactic sources
[03/13/2023-18:55:38] [I] timingCacheMode: local
[03/13/2023-18:55:38] [I] timingCacheFile:
[03/13/2023-18:55:38] [I] Input(s)s format: fp32:CHW
[03/13/2023-18:55:38] [I] Output(s)s format: fp32:CHW
[03/13/2023-18:55:38] [I] Input build shape: modelInput=1+1+1
[03/13/2023-18:55:38] [I] Input calibration shapes: model
[03/13/2023-18:55:38] [I] === System Options ===
[03/13/2023-18:55:38] [I] Device: 0
[03/13/2023-18:55:38] [I] DLACore:
[03/13/2023-18:55:38] [I] Plugins:
[03/13/2023-18:55:38] [I] === Inference Options ===
[03/13/2023-18:55:38] [I] Batch: Explicit
[03/13/2023-18:55:38] [I] Input inference shape: modelInput=1
[03/13/2023-18:55:38] [I] Iterations: 10
[03/13/2023-18:55:38] [I] Duration: 3s (+ 200ms warm up)
[03/13/2023-18:55:38] [I] Sleep time: 0ms
[03/13/2023-18:55:38] [I] Idle time: 0ms
[03/13/2023-18:55:38] [I] Streams: 1
[03/13/2023-18:55:38] [I] ExposeDMA: Disabled
[03/13/2023-18:55:38] [I] Data transfers: Enabled
[03/13/2023-18:55:38] [I] Spin-wait: Disabled
[03/13/2023-18:55:38] [I] Multithreading: Disabled
[03/13/2023-18:55:38] [I] CUDA Graph: Disabled
[03/13/2023-18:55:38] [I] Separate profiling: Disabled
[03/13/2023-18:55:38] [I] Time Deserialize: Disabled
[03/13/2023-18:55:38] [I] Time Refit: Disabled
[03/13/2023-18:55:38] [I] Skip inference: Disabled
[03/13/2023-18:55:38] [I] Inputs:
[03/13/2023-18:55:38] [I] === Reporting Options ===
[03/13/2023-18:55:38] [I] Verbose: Disabled
[03/13/2023-18:55:38] [I] Averages: 10 inferences
[03/13/2023-18:55:38] [I] Percentile: 99
[03/13/2023-18:55:38] [I] Dump refittable layers:Disabled
[03/13/2023-18:55:38] [I] Dump output: Disabled
[03/13/2023-18:55:38] [I] Profile: Disabled
[03/13/2023-18:55:38] [I] Export timing to JSON file:
[03/13/2023-18:55:38] [I] Export output to JSON file:
[03/13/2023-18:55:38] [I] Export profile to JSON file:
[03/13/2023-18:55:38] [I]
[03/13/2023-18:55:38] [I] === Device Information ===
[03/13/2023-18:55:38] [I] Selected Device: NVIDIA Tegra X1
[03/13/2023-18:55:38] [I] Compute Capability: 5.3
[03/13/2023-18:55:38] [I] SMs: 1
[03/13/2023-18:55:38] [I] Compute Clock Rate: 0.9216 GHz
[03/13/2023-18:55:38] [I] Device Global Memory: 3964 MiB
[03/13/2023-18:55:38] [I] Shared Memory per SM: 64 KiB
[03/13/2023-18:55:38] [I] Memory Bus Width: 64 bits (ECC disabled)
[03/13/2023-18:55:38] [I] Memory Clock Rate: 0.01275 GHz
[03/13/2023-18:55:38] [I]
[03/13/2023-18:55:38] [I] TensorRT version: 8.2.1
[03/13/2023-18:55:40] [I] [TRT] [MemUsageChange] Init CUDA: CPU +229, GPU +0, now: CPU 289, GPU 2525 (MiB)
[03/13/2023-18:55:40] [I] [TRT] Loaded engine size: 41 MiB
[03/13/2023-18:55:41] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +160, now: CPU 448, GPU 2686 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +240, GPU +243, now: CPU 688, GPU 2929 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +41, now: CPU 0, GPU 41 (MiB)
[03/13/2023-18:55:42] [I] Engine loaded in 4.66617 sec.
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 647, GPU 2887 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 647, GPU 2887 (MiB)
[03/13/2023-18:55:42] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +192, now: CPU 0, GPU 233 (MiB)
[03/13/2023-18:55:42] [I] Using random values for input modelInput
[03/13/2023-18:55:43] [I] Created input binding for modelInput with dimensions 1x3x720x1280
[03/13/2023-18:55:43] [I] Using random values for output modelOutput
[03/13/2023-18:55:43] [I] Created output binding for modelOutput with dimensions 1x1x720x1280
[03/13/2023-18:55:43] [I] Starting inference
[03/13/2023-18:56:00] [I] Warmup completed 1 queries over 200 ms
[03/13/2023-18:56:00] [I] Timing trace has 10 queries over 14.5163 s
[03/13/2023-18:56:00] [I]
[03/13/2023-18:56:00] [I] === Trace details ===
[03/13/2023-18:56:00] [I] Trace averages of 10 runs:
[03/13/2023-18:56:00] [I] Average on 10 runs - GPU latency: 1450.18 ms - Host latency: 1451.62 ms (end to end 1451.63 ms, enqueue 13.3297 ms)
[03/13/2023-18:56:00] [I]
[03/13/2023-18:56:00] [I] === Performance summary ===
[03/13/2023-18:56:00] [I] Throughput: 0.68888 qps
[03/13/2023-18:56:00] [I] Latency: min = 1445.93 ms, max = 1456.88 ms, mean = 1451.62 ms, median = 1451.2 ms, percentile(99%) = 1456.88 ms
[03/13/2023-18:56:00] [I] End-to-End Host Latency: min = 1445.94 ms, max = 1456.9 ms, mean = 1451.63 ms, median = 1451.22 ms, percentile(99%) = 1456.9 ms
[03/13/2023-18:56:00] [I] Enqueue Time: min = 2.87402 ms, max = 17.4658 ms, mean = 13.3297 ms, median = 13.9668 ms, percentile(99%) = 17.4658 ms
[03/13/2023-18:56:00] [I] H2D Latency: min = 1.07715 ms, max = 1.08496 ms, mean = 1.08086 ms, median = 1.08057 ms, percentile(99%) = 1.08496 ms
[03/13/2023-18:56:00] [I] GPU Compute Time: min = 1444.49 ms, max = 1455.45 ms, mean = 1450.18 ms, median = 1449.76 ms, percentile(99%) = 1455.45 ms
[03/13/2023-18:56:00] [I] D2H Latency: min = 0.355469 ms, max = 0.358398 ms, mean = 0.357129 ms, median = 0.357422 ms, percentile(99%) = 0.358398 ms
[03/13/2023-18:56:00] [I] Total Host Walltime: 14.5163 s
[03/13/2023-18:56:00] [I] Total GPU Compute Time: 14.5018 s
[03/13/2023-18:56:00] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/13/2023-18:56:00] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # trtexec --loadEngine=unethalf_engine.engine --shapes=modelInput:1 3720 1280
Environment
TensorRT Version: 8.2.x
GPU Type: Jetsonnano
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):