Tensorrt can not speed up well

songyp19941 · June 13, 2022, 3:58am

Description

I tried tensorrt with model. Using FP16 the model can be about 20% faster than pytorch. But when using FP32 the model is about 20% slower than pytorch. It looks like tensorrt works not so well with my hardware.
Is that possible, my ubuntu version is too new, i have to install supplies like cudnn desinged for ubuntu 20.04.

Environment

TensorRT Version: 8.4.0.6
GPU Type: 3090
Nvidia Driver Version: 470.74
CUDA Version: 11.4
CUDNN Version: 8.4.0
CUDAToolkit:11.3
Operating System + Version: Ubuntu 21.10
Python Version (if applicable): 3.7.4
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

NVES · June 14, 2022, 7:41am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

songyp19941 · June 14, 2022, 8:25am

Thanks for your reply. Thats is another problem, i have installed tensorrt, but no trtexec tool is found, neither the samples file.

Im not totally sure i have installed correctlly, or you have deleted this tool in the newest version of tensorrt.

Another problem is that, the inference results are also wrong, as long as set some of the configuration flags using config.set_flag(trt.BuilderFlag.xxx) like trt.BuilderFlag.FP16 or trt.BuilderFlag.GPU_FALLBACK. That could be a FP16 precision problem, but even if i manually set the precision of all layers
Following is the script i used to generate the trt model, i have also tried to generate trt model using polygraphy api, the problems remain.
The model and the network should be correct, because it runs well in another machine, which has 2080 GPU. We have also tried the code in another 3090 laptop, the prediction is also incorrect.

import tensorrt as trt
import polygraphy

model_path = ‘./models/nnUNet_model_best.onnx’

logger = trt.Logger(trt.Logger.WARNING)
class MyLogger(trt.ILogger):
def init(self):
trt.ILogger.init(self)

def log(self, severity, msg):
    pass # Your custom logging implementation here

logger = MyLogger()

builder = trt.Builder(logger)

print("If the platform has fast int8 ", builder.platform_has_fast_int8)

network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) # create a network definition
parser = trt.OnnxParser(network, logger)

success = parser.parse_from_file(model_path)
for idx in range(parser.num_errors):
print(parser.get_error(idx))

if not success:
pass # Error handling code here

config = builder.create_builder_config()
config.reset()
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)

config.set_flag(trt.BuilderFlag.FP16)

serialized_engine = builder.build_serialized_network(network, config)

with open(“./models/nnUNet_model_best_fp16.trt”, “wb”) as f:
f.write(serialized_engine)

spolisetty · June 17, 2022, 2:46pm

It should be located in /usr/src/tensorrt/bin or /opt/tensorrt/bin

Also please share with us the issue repro ONNX model and script to try from our end for better debugging.

Thank you.

songyp19941 · June 20, 2022, 3:43am

Thanks for your replay, i have found the trtexec tool, and use it to build an engine. But the prediciton is still wrong when using fp16. Following is the repo and the script.
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # ./trtexec --onnx=/home/trtexec_test/nnUNet_model_best.onnx --explicitBatch --saveEngine=/home/trtexec_test/nnUNet_model_best.trt --fp16
[06/20/2022-11:15:49] [W] --explicitBatch flag has been deprecated and has no effect!
[06/20/2022-11:15:49] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[06/20/2022-11:15:49] [I] === Model Options ===
[06/20/2022-11:15:49] [I] Format: ONNX
[06/20/2022-11:15:49] [I] Model: /home/trtexec_test/nnUNet_model_best.onnx
[06/20/2022-11:15:49] [I] Output:
[06/20/2022-11:15:49] [I] === Build Options ===
[06/20/2022-11:15:49] [I] Max batch: explicit batch
[06/20/2022-11:15:49] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/20/2022-11:15:49] [I] minTiming: 1
[06/20/2022-11:15:49] [I] avgTiming: 8
[06/20/2022-11:15:49] [I] Precision: FP32+FP16
[06/20/2022-11:15:49] [I] LayerPrecisions:
[06/20/2022-11:15:49] [I] Calibration:
[06/20/2022-11:15:49] [I] Refit: Disabled
[06/20/2022-11:15:49] [I] Sparsity: Disabled
[06/20/2022-11:15:49] [I] Safe mode: Disabled
[06/20/2022-11:15:49] [I] DirectIO mode: Disabled
[06/20/2022-11:15:49] [I] Restricted mode: Disabled
[06/20/2022-11:15:49] [I] Save engine: /home/hurwa/trtexec_test/nnUNet_model_best.trt
[06/20/2022-11:15:49] [I] Load engine:
[06/20/2022-11:15:49] [I] Profiling verbosity: 0
[06/20/2022-11:15:49] [I] Tactic sources: Using default tactic sources
[06/20/2022-11:15:49] [I] timingCacheMode: local
[06/20/2022-11:15:49] [I] timingCacheFile:
[06/20/2022-11:15:49] [I] Input(s)s format: fp32:CHW
[06/20/2022-11:15:49] [I] Output(s)s format: fp32:CHW
[06/20/2022-11:15:49] [I] Input build shapes: model
[06/20/2022-11:15:49] [I] Input calibration shapes: model
[06/20/2022-11:15:49] [I] === System Options ===
[06/20/2022-11:15:49] [I] Device: 0
[06/20/2022-11:15:49] [I] DLACore:
[06/20/2022-11:15:49] [I] Plugins:
[06/20/2022-11:15:49] [I] === Inference Options ===
[06/20/2022-11:15:49] [I] Batch: Explicit
[06/20/2022-11:15:49] [I] Input inference shapes: model
[06/20/2022-11:15:49] [I] Iterations: 10
[06/20/2022-11:15:49] [I] Duration: 3s (+ 200ms warm up)
[06/20/2022-11:15:49] [I] Sleep time: 0ms
[06/20/2022-11:15:49] [I] Idle time: 0ms
[06/20/2022-11:15:49] [I] Streams: 1
[06/20/2022-11:15:49] [I] ExposeDMA: Disabled
[06/20/2022-11:15:49] [I] Data transfers: Enabled
[06/20/2022-11:15:49] [I] Spin-wait: Disabled
[06/20/2022-11:15:49] [I] Multithreading: Disabled
[06/20/2022-11:15:49] [I] CUDA Graph: Disabled
[06/20/2022-11:15:49] [I] Separate profiling: Disabled
[06/20/2022-11:15:49] [I] Time Deserialize: Disabled
[06/20/2022-11:15:49] [I] Time Refit: Disabled
[06/20/2022-11:15:49] [I] Skip inference: Disabled
[06/20/2022-11:15:49] [I] Inputs:
[06/20/2022-11:15:49] [I] === Reporting Options ===
[06/20/2022-11:15:49] [I] Verbose: Disabled
[06/20/2022-11:15:49] [I] Averages: 10 inferences
[06/20/2022-11:15:49] [I] Percentile: 99
[06/20/2022-11:15:49] [I] Dump refittable layers:Disabled
[06/20/2022-11:15:49] [I] Dump output: Disabled
[06/20/2022-11:15:49] [I] Profile: Disabled
[06/20/2022-11:15:49] [I] Export timing to JSON file:
[06/20/2022-11:15:49] [I] Export output to JSON file:
[06/20/2022-11:15:49] [I] Export profile to JSON file:
[06/20/2022-11:15:49] [I]
[06/20/2022-11:15:49] [I] === Device Information ===
[06/20/2022-11:15:49] [I] Selected Device: NVIDIA GeForce RTX 3090
[06/20/2022-11:15:49] [I] Compute Capability: 8.6
[06/20/2022-11:15:49] [I] SMs: 82
[06/20/2022-11:15:49] [I] Compute Clock Rate: 1.695 GHz
[06/20/2022-11:15:49] [I] Device Global Memory: 24259 MiB
[06/20/2022-11:15:49] [I] Shared Memory per SM: 100 KiB
[06/20/2022-11:15:49] [I] Memory Bus Width: 384 bits (ECC disabled)
[06/20/2022-11:15:49] [I] Memory Clock Rate: 9.751 GHz
[06/20/2022-11:15:49] [I]
[06/20/2022-11:15:49] [I] TensorRT version: 8.4.0
[06/20/2022-11:15:50] [I] [TRT] [MemUsageChange] Init CUDA: CPU +357, GPU +0, now: CPU 365, GPU 860 (MiB)
[06/20/2022-11:15:50] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 384 MiB, GPU 860 MiB
[06/20/2022-11:15:50] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 759 MiB, GPU 982 MiB
[06/20/2022-11:15:50] [I] Start parsing network model
[06/20/2022-11:15:50] [I] [TRT] ----------------------------------------------------------------
[06/20/2022-11:15:50] [I] [TRT] Input filename: /home/trtexec_test/nnUNet_model_best.onnx
[06/20/2022-11:15:50] [I] [TRT] ONNX IR version: 0.0.5
[06/20/2022-11:15:50] [I] [TRT] Opset version: 10
[06/20/2022-11:15:50] [I] [TRT] Producer name: pytorch
[06/20/2022-11:15:50] [I] [TRT] Producer version: 1.11.0
[06/20/2022-11:15:50] [I] [TRT] Domain:
[06/20/2022-11:15:50] [I] [TRT] Model version: 0
[06/20/2022-11:15:50] [I] [TRT] Doc string:
[06/20/2022-11:15:50] [I] [TRT] ----------------------------------------------------------------
[06/20/2022-11:15:52] [I] Finish parsing network model
[06/20/2022-11:15:52] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:15:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1830, GPU 1863 (MiB)
[06/20/2022-11:15:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +13, now: CPU 1830, GPU 1876 (MiB)
[06/20/2022-11:15:52] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[06/20/2022-11:18:55] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[06/20/2022-11:18:56] [I] [TRT] Total Host Persistent Memory: 102496
[06/20/2022-11:18:56] [I] [TRT] Total Device Persistent Memory: 10820608
[06/20/2022-11:18:56] [I] [TRT] Total Scratch Memory: 2397440
[06/20/2022-11:18:56] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 85 MiB, GPU 7319 MiB
[06/20/2022-11:18:56] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 10.6542ms to assign 9 blocks to 125 nodes requiring 654170112 bytes.
[06/20/2022-11:18:56] [I] [TRT] Total Activation Memory: 654170112
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3140, GPU 2826 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 2834 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +85, GPU +96, now: CPU 85, GPU 96 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3217, GPU 2286 (MiB)
[06/20/2022-11:18:56] [I] [TRT] Loaded engine size: 86 MiB
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3225, GPU 2594 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3225, GPU 2604 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +95, now: CPU 0, GPU 95 (MiB)
[06/20/2022-11:18:56] [I] Engine built in 187.03 sec.
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2502, GPU 2280 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2502, GPU 2288 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +634, now: CPU 0, GPU 729 (MiB)
[06/20/2022-11:18:56] [I] Using random values for input input
[06/20/2022-11:18:56] [I] Created input binding for input with dimensions 1x1x96x256x96
[06/20/2022-11:18:56] [I] Using random values for output output6
[06/20/2022-11:18:56] [I] Created output binding for output6 with dimensions 1x3x6x8x6
[06/20/2022-11:18:56] [I] Using random values for output output5
[06/20/2022-11:18:56] [I] Created output binding for output5 with dimensions 1x3x6x16x6
[06/20/2022-11:18:56] [I] Using random values for output output4
[06/20/2022-11:18:56] [I] Created output binding for output4 with dimensions 1x3x12x32x12
[06/20/2022-11:18:56] [I] Using random values for output output3
[06/20/2022-11:18:56] [I] Created output binding for output3 with dimensions 1x3x24x64x24
[06/20/2022-11:18:56] [I] Using random values for output output2
[06/20/2022-11:18:56] [I] Created output binding for output2 with dimensions 1x3x48x128x48
[06/20/2022-11:18:56] [I] Using random values for output output1
[06/20/2022-11:18:56] [I] Created output binding for output1 with dimensions 1x3x96x256x96
[06/20/2022-11:18:56] [I] Starting inference
[06/20/2022-11:18:59] [I] Warmup completed 10 queries over 200 ms
[06/20/2022-11:18:59] [I] Timing trace has 137 queries over 3.07714 s
[06/20/2022-11:18:59] [I]
[06/20/2022-11:18:59] [I] === Trace details ===
[06/20/2022-11:18:59] [I] Trace averages of 10 runs:
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.6876 ms - Host latency: 24.8693 ms (end to end 44.7577 ms, enqueue 1.72647 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.0524 ms - Host latency: 25.3096 ms (end to end 46.1373 ms, enqueue 1.93383 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.14 ms - Host latency: 23.3075 ms (end to end 41.8735 ms, enqueue 1.66331 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.833 ms - Host latency: 24.0656 ms (end to end 43.392 ms, enqueue 1.86876 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.1491 ms - Host latency: 25.4065 ms (end to end 45.819 ms, enqueue 1.83251 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.6759 ms - Host latency: 24.9184 ms (end to end 45.1935 ms, enqueue 1.77281 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.3631 ms - Host latency: 23.5985 ms (end to end 42.2557 ms, enqueue 1.77977 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.7853 ms - Host latency: 26.0267 ms (end to end 47.2052 ms, enqueue 1.77408 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.4162 ms - Host latency: 24.6669 ms (end to end 44.816 ms, enqueue 1.80068 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.3482 ms - Host latency: 23.5172 ms (end to end 42.3275 ms, enqueue 1.89629 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.2345 ms - Host latency: 23.4628 ms (end to end 42.1894 ms, enqueue 1.91763 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.0117 ms - Host latency: 24.2404 ms (end to end 43.5459 ms, enqueue 1.95603 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.8791 ms - Host latency: 25.1237 ms (end to end 45.6101 ms, enqueue 1.76414 ms)
[06/20/2022-11:18:59] [I]
[06/20/2022-11:18:59] [I] === Performance summary ===
[06/20/2022-11:18:59] [I] Throughput: 44.5219 qps
[06/20/2022-11:18:59] [I] Latency: min = 23.0624 ms, max = 27.8524 ms, mean = 24.5346 ms, median = 23.9123 ms, percentile(99%) = 27.2732 ms
[06/20/2022-11:18:59] [I] End-to-End Host Latency: min = 41.616 ms, max = 48.0995 ms, mean = 44.2912 ms, median = 43.2527 ms, percentile(99%) = 48.0986 ms
[06/20/2022-11:18:59] [I] Enqueue Time: min = 0.677887 ms, max = 2.58347 ms, mean = 1.81925 ms, median = 1.81586 ms, percentile(99%) = 2.40967 ms
[06/20/2022-11:18:59] [I] H2D Latency: min = 0.386871 ms, max = 0.479736 ms, mean = 0.437658 ms, median = 0.438965 ms, percentile(99%) = 0.464966 ms
[06/20/2022-11:18:59] [I] GPU Compute Time: min = 20.8108 ms, max = 25.9625 ms, mean = 22.3062 ms, median = 21.6658 ms, percentile(99%) = 24.9928 ms
[06/20/2022-11:18:59] [I] D2H Latency: min = 1.44308 ms, max = 1.88623 ms, mean = 1.7908 ms, median = 1.81104 ms, percentile(99%) = 1.88232 ms
[06/20/2022-11:18:59] [I] Total Host Walltime: 3.07714 s
[06/20/2022-11:18:59] [I] Total GPU Compute Time: 3.05595 s
[06/20/2022-11:18:59] [W] * GPU compute time is unstable, with coefficient of variance = 5.60913%.
[06/20/2022-11:18:59] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/20/2022-11:18:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/20/2022-11:18:59] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # ./trtexec --onnx=/home/trtexec_test/nnUNet_model_best.onnx --explicitBatch --saveEngine=/home/trtexec_test/nnUNet_model_best.trt --fp16

spolisetty · June 23, 2022, 4:44pm

Could you please try ONNX-Runtime and compare results with your Pytorch model to make sure, there is no issue in the ONNX model.
If no issue in the ONNX model, please share with us the issue repro script and ONNX model to try from our end for better debugging.

Thank you.

songyp19941 · June 27, 2022, 8:59am

Hi,

I will test the onnx model, but i think the onnx shouldnt be the problem, because the trt model with fp32 generated from the onnx can predict correctly. How should i share my onnx model with you? May i have an email address, i dont want to share my model to everyone.

spolisetty · June 29, 2022, 9:43am

Please share with us a minimal issue repro script and model to compare trt fp16 output and onnx output.

Please DM me.

Topic		Replies	Views
BUG: Output TRT engine from trtexec has completely different inference than input model TensorRT tensorrt , debugging-and-troubleshooting	3	2250	January 4, 2022
Inference with TensorRT is different that inference with HDF5 TAO Toolkit	16	511	March 25, 2024
TensorRT 10.1: Different inference results of onnxruntime and tensorrt TensorRT	2	162	August 21, 2024
Conversion PyTorch to TensorRT fails when using FP16 (works with FP32 and INT8) TensorRT cudnn	2	1499	July 9, 2024
Onnx and trt output has a large gap Jetson AGX Xavier tensorrt , onnx	7	99	July 17, 2024
Onnx -> TensorRT. No speed difference between models of different sizes Jetson AGX Xavier tensorrt , onnx	6	827	September 19, 2021
Inswapper onnx model conversion to tensorrt model Jetson AGX Orin tensorrt , onnx	29	1134	January 8, 2025
repeat post, please ignore this TensorRT	0	698	July 12, 2019
Problem converting tensorflow model to TensorRT Jetson Nano tensorrt , tensorflow	5	407	March 26, 2024
Parseq tensorrt conversion takes for ever to complete TensorRT cudnn	1	41	August 30, 2024