Struggling to get model onto tensorrt


Having trouble getting pytorch model onto tensorrt


TensorRT Version: 8.6.3
GPU Type: 3090
Nvidia Driver Version: latest
CUDA Version: 12.1
Operating System + Version: Windows 10
Python Version (if applicable): 3.11
PyTorch Version (if applicable): 3.11
Baremetal or Container (if container which image + tag): 24.02-py3

Relevant Files

onnx model

The model I’m trying to get onto tensorrt is:

Steps To Reproduce

I’ve been trying all day for the past few days to get my pytorch model onto tensorrt and I’ve been facing a lot of issues and I’m hoping someone can help me out.

What I’ve tried:

  1. Getting to tensorrt via torch-tensorrt.
    I can compile my model using torch_tensorrt.compile and save it. However, this approach has 2 major issues for me right now that make it unusable:

a. It requires a very large amount of VRAM to compile the model such that the largest batch size I can compile it at is half the size of the batch size I can use during normal inference and this is severely hurting throughput on the compiled and saved model.

b. I can’t get dynamic batch sizes to work at all. See ticket I opened regarding this issue:

Despite this, I was able to test and see that this model gets around 74 images/sec.

EDIT: I was able to check the accuracy of these results and despite running, the accuracy is basically 0 so something must have went wrong with this setup as well.

  1. Getting to tensorrt via ONNX in the middle
    a. When trying to export to ONNX from pytorch using the current recommendation of using the dynamo engine, I get a resulting ONNX file but then when trying to convert that ONNX file to tensorrt, I hit an error:
    builtin_op_importers.cpp:5404 In function importFallbackPluginImporter · Issue #3734 · NVIDIA/TensorRT · GitHub

b. I discovered I could get around this specific error by instead exporting from pytorch with the older torch.onnx.export method. This resulting ONNX file (attached above) does pass trtexec. However, if I understand it correctly (assuming latency is per batch) it appears to be much slower than the results in 1b (batch size 32):

[03/25/2024-21:11:42] [I] Input binding for modelInput with dimensions 32x3x384x384 is created.
[03/25/2024-21:11:42] [I] Output binding for modelOutput with dimensions 32x36 is created.
[03/25/2024-21:11:42] [I] Starting inference
[03/25/2024-21:12:01] [I] Warmup completed 1 queries over 200 ms
[03/25/2024-21:12:01] [I] Timing trace has 10 queries over 18.6806 s
[03/25/2024-21:12:01] [I]
[03/25/2024-21:12:01] [I] === Trace details ===
[03/25/2024-21:12:01] [I] Trace averages of 10 runs:
[03/25/2024-21:12:01] [I] Average on 10 runs - GPU latency: 1699.23 ms - Host latency: 1703.69 ms (enqueue 3.74132 ms)
[03/25/2024-21:12:01] [I]
[03/25/2024-21:12:01] [I] === Performance summary ===
[03/25/2024-21:12:01] [I] Throughput: 0.535313 qps
[03/25/2024-21:12:01] [I] Latency: min = 1701.02 ms, max = 1708.43 ms, mean = 1703.69 ms, median = 1703.18 ms, percentile(90%) = 1707.24 ms, percentile(95%) = 1708.43 ms, percentile(99%) = 1708.43 ms
[03/25/2024-21:12:01] [I] Enqueue Time: min = 3.43262 ms, max = 4.30225 ms, mean = 3.74132 ms, median = 3.72412 ms, percentile(90%) = 3.83301 ms, percentile(95%) = 4.30225 ms, percentile(99%) = 4.30225 ms
[03/25/2024-21:12:01] [I] H2D Latency: min = 4.43896 ms, max = 4.59091 ms, mean = 4.4607 ms, median = 4.448 ms, percentile(90%) = 4.45312 ms, percentile(95%) = 4.59091 ms, percentile(99%) = 4.59091 ms
[03/25/2024-21:12:01] [I] GPU Compute Time: min = 1696.54 ms, max = 1703.99 ms, mean = 1699.23 ms, median = 1698.73 ms, percentile(90%) = 1702.79 ms, percentile(95%) = 1703.99 ms, percentile(99%) = 1703.99 ms
[03/25/2024-21:12:01] [I] D2H Latency: min = 0.00195312 ms, max = 0.00585938 ms, mean = 0.00385742 ms, median = 0.00341797 ms, percentile(90%) = 0.00537109 ms, percentile(95%) = 0.00585938 ms, percentile(99%) = 0.00585938 ms
[03/25/2024-21:12:01] [I] Total Host Walltime: 18.6806 s
[03/25/2024-21:12:01] [I] Total GPU Compute Time: 16.9923 s

c. I tried to take the resulting tensorrt_engine from b above and run it following all the tensorrt guides but doing so with the following code yields an error:

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

f = open("Test2.trt", "rb")
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) 

engine = runtime.deserialize_cuda_engine(
context = engine.create_execution_context()
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x562809a8d2f0'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x562809ea0f10'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x56280a3287d0'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x56280a7c11b0'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x56280ac24840'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x56280b07f7c0'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x56280b4ded00'.)
[03/25/2024-19:05:01] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin (Error 709 destroying stream '0x56280b9420c0'.)
Segmentation fault

At this point I feel like I’ve tried a lot of potential routes to get this model onto tensorrt but I keep hitting roadblocks at every path and any help would be greatly appreciated!

Hi @nvidiangc75 ,
Trying this at our end and shall update you soon

Very much appreciated! If you need anything additional from me (such as potentially the onnx file exported from pytorch using dynamo_export instead of the older message that gave a different error, just let me know!