Tensorrt can not speed up well

Description

I tried tensorrt with model. Using FP16 the model can be about 20% faster than pytorch. But when using FP32 the model is about 20% slower than pytorch. It looks like tensorrt works not so well with my hardware.
Is that possible, my ubuntu version is too new, i have to install supplies like cudnn desinged for ubuntu 20.04.

Environment

TensorRT Version: 8.4.0.6
GPU Type: 3090
Nvidia Driver Version: 470.74
CUDA Version: 11.4
CUDNN Version: 8.4.0
CUDAToolkit:11.3
Operating System + Version: Ubuntu 21.10
Python Version (if applicable): 3.7.4
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.11
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#measure-performance

https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-803/best-practices/index.html#model-accuracy

Thanks!

Thanks for your reply. Thats is another problem, i have installed tensorrt, but no trtexec tool is found, neither the samples file.


Im not totally sure i have installed correctlly, or you have deleted this tool in the newest version of tensorrt.

Another problem is that, the inference results are also wrong, as long as set some of the configuration flags using config.set_flag(trt.BuilderFlag.xxx) like trt.BuilderFlag.FP16 or trt.BuilderFlag.GPU_FALLBACK. That could be a FP16 precision problem, but even if i manually set the precision of all layers
Following is the script i used to generate the trt model, i have also tried to generate trt model using polygraphy api, the problems remain.
The model and the network should be correct, because it runs well in another machine, which has 2080 GPU. We have also tried the code in another 3090 laptop, the prediction is also incorrect.

import tensorrt as trt
import polygraphy

model_path = ‘./models/nnUNet_model_best.onnx’

logger = trt.Logger(trt.Logger.WARNING)
class MyLogger(trt.ILogger):
def init(self):
trt.ILogger.init(self)

def log(self, severity, msg):
    pass # Your custom logging implementation here

logger = MyLogger()

builder = trt.Builder(logger)

print("If the platform has fast int8 ", builder.platform_has_fast_int8)

network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) # create a network definition
parser = trt.OnnxParser(network, logger)

success = parser.parse_from_file(model_path)
for idx in range(parser.num_errors):
print(parser.get_error(idx))

if not success:
pass # Error handling code here

config = builder.create_builder_config()
config.reset()
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)

config.set_flag(trt.BuilderFlag.FP16)

serialized_engine = builder.build_serialized_network(network, config)

with open(“./models/nnUNet_model_best_fp16.trt”, “wb”) as f:
f.write(serialized_engine)

It should be located in /usr/src/tensorrt/bin or /opt/tensorrt/bin

Also please share with us the issue repro ONNX model and script to try from our end for better debugging.

Thank you.

Thanks for your replay, i have found the trtexec tool, and use it to build an engine. But the prediciton is still wrong when using fp16. Following is the repo and the script.
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # ./trtexec --onnx=/home/trtexec_test/nnUNet_model_best.onnx --explicitBatch --saveEngine=/home/trtexec_test/nnUNet_model_best.trt --fp16
[06/20/2022-11:15:49] [W] --explicitBatch flag has been deprecated and has no effect!
[06/20/2022-11:15:49] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[06/20/2022-11:15:49] [I] === Model Options ===
[06/20/2022-11:15:49] [I] Format: ONNX
[06/20/2022-11:15:49] [I] Model: /home/trtexec_test/nnUNet_model_best.onnx
[06/20/2022-11:15:49] [I] Output:
[06/20/2022-11:15:49] [I] === Build Options ===
[06/20/2022-11:15:49] [I] Max batch: explicit batch
[06/20/2022-11:15:49] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[06/20/2022-11:15:49] [I] minTiming: 1
[06/20/2022-11:15:49] [I] avgTiming: 8
[06/20/2022-11:15:49] [I] Precision: FP32+FP16
[06/20/2022-11:15:49] [I] LayerPrecisions:
[06/20/2022-11:15:49] [I] Calibration:
[06/20/2022-11:15:49] [I] Refit: Disabled
[06/20/2022-11:15:49] [I] Sparsity: Disabled
[06/20/2022-11:15:49] [I] Safe mode: Disabled
[06/20/2022-11:15:49] [I] DirectIO mode: Disabled
[06/20/2022-11:15:49] [I] Restricted mode: Disabled
[06/20/2022-11:15:49] [I] Save engine: /home/hurwa/trtexec_test/nnUNet_model_best.trt
[06/20/2022-11:15:49] [I] Load engine:
[06/20/2022-11:15:49] [I] Profiling verbosity: 0
[06/20/2022-11:15:49] [I] Tactic sources: Using default tactic sources
[06/20/2022-11:15:49] [I] timingCacheMode: local
[06/20/2022-11:15:49] [I] timingCacheFile:
[06/20/2022-11:15:49] [I] Input(s)s format: fp32:CHW
[06/20/2022-11:15:49] [I] Output(s)s format: fp32:CHW
[06/20/2022-11:15:49] [I] Input build shapes: model
[06/20/2022-11:15:49] [I] Input calibration shapes: model
[06/20/2022-11:15:49] [I] === System Options ===
[06/20/2022-11:15:49] [I] Device: 0
[06/20/2022-11:15:49] [I] DLACore:
[06/20/2022-11:15:49] [I] Plugins:
[06/20/2022-11:15:49] [I] === Inference Options ===
[06/20/2022-11:15:49] [I] Batch: Explicit
[06/20/2022-11:15:49] [I] Input inference shapes: model
[06/20/2022-11:15:49] [I] Iterations: 10
[06/20/2022-11:15:49] [I] Duration: 3s (+ 200ms warm up)
[06/20/2022-11:15:49] [I] Sleep time: 0ms
[06/20/2022-11:15:49] [I] Idle time: 0ms
[06/20/2022-11:15:49] [I] Streams: 1
[06/20/2022-11:15:49] [I] ExposeDMA: Disabled
[06/20/2022-11:15:49] [I] Data transfers: Enabled
[06/20/2022-11:15:49] [I] Spin-wait: Disabled
[06/20/2022-11:15:49] [I] Multithreading: Disabled
[06/20/2022-11:15:49] [I] CUDA Graph: Disabled
[06/20/2022-11:15:49] [I] Separate profiling: Disabled
[06/20/2022-11:15:49] [I] Time Deserialize: Disabled
[06/20/2022-11:15:49] [I] Time Refit: Disabled
[06/20/2022-11:15:49] [I] Skip inference: Disabled
[06/20/2022-11:15:49] [I] Inputs:
[06/20/2022-11:15:49] [I] === Reporting Options ===
[06/20/2022-11:15:49] [I] Verbose: Disabled
[06/20/2022-11:15:49] [I] Averages: 10 inferences
[06/20/2022-11:15:49] [I] Percentile: 99
[06/20/2022-11:15:49] [I] Dump refittable layers:Disabled
[06/20/2022-11:15:49] [I] Dump output: Disabled
[06/20/2022-11:15:49] [I] Profile: Disabled
[06/20/2022-11:15:49] [I] Export timing to JSON file:
[06/20/2022-11:15:49] [I] Export output to JSON file:
[06/20/2022-11:15:49] [I] Export profile to JSON file:
[06/20/2022-11:15:49] [I]
[06/20/2022-11:15:49] [I] === Device Information ===
[06/20/2022-11:15:49] [I] Selected Device: NVIDIA GeForce RTX 3090
[06/20/2022-11:15:49] [I] Compute Capability: 8.6
[06/20/2022-11:15:49] [I] SMs: 82
[06/20/2022-11:15:49] [I] Compute Clock Rate: 1.695 GHz
[06/20/2022-11:15:49] [I] Device Global Memory: 24259 MiB
[06/20/2022-11:15:49] [I] Shared Memory per SM: 100 KiB
[06/20/2022-11:15:49] [I] Memory Bus Width: 384 bits (ECC disabled)
[06/20/2022-11:15:49] [I] Memory Clock Rate: 9.751 GHz
[06/20/2022-11:15:49] [I]
[06/20/2022-11:15:49] [I] TensorRT version: 8.4.0
[06/20/2022-11:15:50] [I] [TRT] [MemUsageChange] Init CUDA: CPU +357, GPU +0, now: CPU 365, GPU 860 (MiB)
[06/20/2022-11:15:50] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 384 MiB, GPU 860 MiB
[06/20/2022-11:15:50] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 759 MiB, GPU 982 MiB
[06/20/2022-11:15:50] [I] Start parsing network model
[06/20/2022-11:15:50] [I] [TRT] ----------------------------------------------------------------
[06/20/2022-11:15:50] [I] [TRT] Input filename: /home/trtexec_test/nnUNet_model_best.onnx
[06/20/2022-11:15:50] [I] [TRT] ONNX IR version: 0.0.5
[06/20/2022-11:15:50] [I] [TRT] Opset version: 10
[06/20/2022-11:15:50] [I] [TRT] Producer name: pytorch
[06/20/2022-11:15:50] [I] [TRT] Producer version: 1.11.0
[06/20/2022-11:15:50] [I] [TRT] Domain:
[06/20/2022-11:15:50] [I] [TRT] Model version: 0
[06/20/2022-11:15:50] [I] [TRT] Doc string:
[06/20/2022-11:15:50] [I] [TRT] ----------------------------------------------------------------
[06/20/2022-11:15:52] [I] Finish parsing network model
[06/20/2022-11:15:52] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:15:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1830, GPU 1863 (MiB)
[06/20/2022-11:15:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +13, now: CPU 1830, GPU 1876 (MiB)
[06/20/2022-11:15:52] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[06/20/2022-11:18:55] [I] [TRT] Detected 1 inputs and 6 output network tensors.
[06/20/2022-11:18:56] [I] [TRT] Total Host Persistent Memory: 102496
[06/20/2022-11:18:56] [I] [TRT] Total Device Persistent Memory: 10820608
[06/20/2022-11:18:56] [I] [TRT] Total Scratch Memory: 2397440
[06/20/2022-11:18:56] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 85 MiB, GPU 7319 MiB
[06/20/2022-11:18:56] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 10.6542ms to assign 9 blocks to 125 nodes requiring 654170112 bytes.
[06/20/2022-11:18:56] [I] [TRT] Total Activation Memory: 654170112
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3140, GPU 2826 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 2834 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +85, GPU +96, now: CPU 85, GPU 96 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 3217, GPU 2286 (MiB)
[06/20/2022-11:18:56] [I] [TRT] Loaded engine size: 86 MiB
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3225, GPU 2594 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3225, GPU 2604 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +95, now: CPU 0, GPU 95 (MiB)
[06/20/2022-11:18:56] [I] Engine built in 187.03 sec.
[06/20/2022-11:18:56] [W] [TRT] TensorRT was linked against cuBLAS/cuBLAS LT 11.8.0 but loaded cuBLAS/cuBLAS LT 11.5.1
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2502, GPU 2280 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 2502, GPU 2288 (MiB)
[06/20/2022-11:18:56] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +634, now: CPU 0, GPU 729 (MiB)
[06/20/2022-11:18:56] [I] Using random values for input input
[06/20/2022-11:18:56] [I] Created input binding for input with dimensions 1x1x96x256x96
[06/20/2022-11:18:56] [I] Using random values for output output6
[06/20/2022-11:18:56] [I] Created output binding for output6 with dimensions 1x3x6x8x6
[06/20/2022-11:18:56] [I] Using random values for output output5
[06/20/2022-11:18:56] [I] Created output binding for output5 with dimensions 1x3x6x16x6
[06/20/2022-11:18:56] [I] Using random values for output output4
[06/20/2022-11:18:56] [I] Created output binding for output4 with dimensions 1x3x12x32x12
[06/20/2022-11:18:56] [I] Using random values for output output3
[06/20/2022-11:18:56] [I] Created output binding for output3 with dimensions 1x3x24x64x24
[06/20/2022-11:18:56] [I] Using random values for output output2
[06/20/2022-11:18:56] [I] Created output binding for output2 with dimensions 1x3x48x128x48
[06/20/2022-11:18:56] [I] Using random values for output output1
[06/20/2022-11:18:56] [I] Created output binding for output1 with dimensions 1x3x96x256x96
[06/20/2022-11:18:56] [I] Starting inference
[06/20/2022-11:18:59] [I] Warmup completed 10 queries over 200 ms
[06/20/2022-11:18:59] [I] Timing trace has 137 queries over 3.07714 s
[06/20/2022-11:18:59] [I]
[06/20/2022-11:18:59] [I] === Trace details ===
[06/20/2022-11:18:59] [I] Trace averages of 10 runs:
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.6876 ms - Host latency: 24.8693 ms (end to end 44.7577 ms, enqueue 1.72647 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.0524 ms - Host latency: 25.3096 ms (end to end 46.1373 ms, enqueue 1.93383 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.14 ms - Host latency: 23.3075 ms (end to end 41.8735 ms, enqueue 1.66331 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.833 ms - Host latency: 24.0656 ms (end to end 43.392 ms, enqueue 1.86876 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.1491 ms - Host latency: 25.4065 ms (end to end 45.819 ms, enqueue 1.83251 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.6759 ms - Host latency: 24.9184 ms (end to end 45.1935 ms, enqueue 1.77281 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.3631 ms - Host latency: 23.5985 ms (end to end 42.2557 ms, enqueue 1.77977 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 23.7853 ms - Host latency: 26.0267 ms (end to end 47.2052 ms, enqueue 1.77408 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.4162 ms - Host latency: 24.6669 ms (end to end 44.816 ms, enqueue 1.80068 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.3482 ms - Host latency: 23.5172 ms (end to end 42.3275 ms, enqueue 1.89629 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 21.2345 ms - Host latency: 23.4628 ms (end to end 42.1894 ms, enqueue 1.91763 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.0117 ms - Host latency: 24.2404 ms (end to end 43.5459 ms, enqueue 1.95603 ms)
[06/20/2022-11:18:59] [I] Average on 10 runs - GPU latency: 22.8791 ms - Host latency: 25.1237 ms (end to end 45.6101 ms, enqueue 1.76414 ms)
[06/20/2022-11:18:59] [I]
[06/20/2022-11:18:59] [I] === Performance summary ===
[06/20/2022-11:18:59] [I] Throughput: 44.5219 qps
[06/20/2022-11:18:59] [I] Latency: min = 23.0624 ms, max = 27.8524 ms, mean = 24.5346 ms, median = 23.9123 ms, percentile(99%) = 27.2732 ms
[06/20/2022-11:18:59] [I] End-to-End Host Latency: min = 41.616 ms, max = 48.0995 ms, mean = 44.2912 ms, median = 43.2527 ms, percentile(99%) = 48.0986 ms
[06/20/2022-11:18:59] [I] Enqueue Time: min = 0.677887 ms, max = 2.58347 ms, mean = 1.81925 ms, median = 1.81586 ms, percentile(99%) = 2.40967 ms
[06/20/2022-11:18:59] [I] H2D Latency: min = 0.386871 ms, max = 0.479736 ms, mean = 0.437658 ms, median = 0.438965 ms, percentile(99%) = 0.464966 ms
[06/20/2022-11:18:59] [I] GPU Compute Time: min = 20.8108 ms, max = 25.9625 ms, mean = 22.3062 ms, median = 21.6658 ms, percentile(99%) = 24.9928 ms
[06/20/2022-11:18:59] [I] D2H Latency: min = 1.44308 ms, max = 1.88623 ms, mean = 1.7908 ms, median = 1.81104 ms, percentile(99%) = 1.88232 ms
[06/20/2022-11:18:59] [I] Total Host Walltime: 3.07714 s
[06/20/2022-11:18:59] [I] Total GPU Compute Time: 3.05595 s
[06/20/2022-11:18:59] [W] * GPU compute time is unstable, with coefficient of variance = 5.60913%.
[06/20/2022-11:18:59] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[06/20/2022-11:18:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[06/20/2022-11:18:59] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # ./trtexec --onnx=/home/trtexec_test/nnUNet_model_best.onnx --explicitBatch --saveEngine=/home/trtexec_test/nnUNet_model_best.trt --fp16

Could you please try ONNX-Runtime and compare results with your Pytorch model to make sure, there is no issue in the ONNX model.
If no issue in the ONNX model, please share with us the issue repro script and ONNX model to try from our end for better debugging.

Thank you.

Hi,

I will test the onnx model, but i think the onnx shouldnt be the problem, because the trt model with fp32 generated from the onnx can predict correctly. How should i share my onnx model with you? May i have an email address, i dont want to share my model to everyone.

Please share with us a minimal issue repro script and model to compare trt fp16 output and onnx output.

Please DM me.