Description
I’d like to make TensorRT engine file work across different compute capabilities. I’ve found that we can build Cuda application to be backward compatible across different compute capabilities. See this link.
With this knowledge, I thought it might be possible to do the same for TensorRT engine file by building trtexec
tool with multiple architectures support. However, after following the build instructions from TensorRT github repo, the build requires a prebuilt package from Nvidia Developer Zone that contains libnvinfer
which is not generated from the source code from Github. Due to this, it seems that making the engine file compatible across compute capabilities is not possible? If it is, is there any other approach?
This is how I verified
- download model from here
- clone TensorRT from Github repo and download the prebuilt package for the build.
- build with multiple architectures enabled. (GPU_ARCHS is not defined. Generating CUDA code for default SMs: 53;60;61;70;75).
- use
trtexec
tool to convert some onnx model onNVIDIA GeForce GTX 1650 Ti
(compute capability 7.5)
trtexec --onnx=mobilenetv2-7.onnx --workspace=64 --fp16 --explicitBatch --saveEngine=mobilenetv2.engine
- execute the built engine file on a different machine with
NVIDIA GeForce GTX 1050 Ti
(compute capability 6.1)
~/TensorRT-8.2.1.8/bin/trtexec --loadEngine=mobilenetv2.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8201] # /home/mle/TensorRT-8.2.1.8/bin/trtexec --loadEngine=mobilenetv2.engine
[05/19/2022-09:40:27] [I] === Model Options ===
[05/19/2022-09:40:27] [I] Format: *
[05/19/2022-09:40:27] [I] Model:
[05/19/2022-09:40:27] [I] Output:
[05/19/2022-09:40:27] [I] === Build Options ===
[05/19/2022-09:40:27] [I] Max batch: 1
[05/19/2022-09:40:27] [I] Workspace: 16 MiB
[05/19/2022-09:40:27] [I] minTiming: 1
[05/19/2022-09:40:27] [I] avgTiming: 8
[05/19/2022-09:40:27] [I] Precision: FP32
[05/19/2022-09:40:27] [I] Calibration:
[05/19/2022-09:40:27] [I] Refit: Disabled
[05/19/2022-09:40:27] [I] Sparsity: Disabled
[05/19/2022-09:40:27] [I] Safe mode: Disabled
[05/19/2022-09:40:27] [I] DirectIO mode: Disabled
[05/19/2022-09:40:27] [I] Restricted mode: Disabled
[05/19/2022-09:40:27] [I] Save engine:
[05/19/2022-09:40:27] [I] Load engine: mobilenetv2.engine
[05/19/2022-09:40:27] [I] Profiling verbosity: 0
[05/19/2022-09:40:27] [I] Tactic sources: Using default tactic sources
[05/19/2022-09:40:27] [I] timingCacheMode: local
[05/19/2022-09:40:27] [I] timingCacheFile:
[05/19/2022-09:40:27] [I] Input(s)s format: fp32:CHW
[05/19/2022-09:40:27] [I] Output(s)s format: fp32:CHW
[05/19/2022-09:40:27] [I] Input build shapes: model
[05/19/2022-09:40:27] [I] Input calibration shapes: model
[05/19/2022-09:40:27] [I] === System Options ===
[05/19/2022-09:40:27] [I] Device: 0
[05/19/2022-09:40:27] [I] DLACore:
[05/19/2022-09:40:27] [I] Plugins:
[05/19/2022-09:40:27] [I] === Inference Options ===
[05/19/2022-09:40:27] [I] Batch: 1
[05/19/2022-09:40:27] [I] Input inference shapes: model
[05/19/2022-09:40:27] [I] Iterations: 10
[05/19/2022-09:40:27] [I] Duration: 3s (+ 200ms warm up)
[05/19/2022-09:40:27] [I] Sleep time: 0ms
[05/19/2022-09:40:27] [I] Idle time: 0ms
[05/19/2022-09:40:27] [I] Streams: 1
[05/19/2022-09:40:27] [I] ExposeDMA: Disabled
[05/19/2022-09:40:27] [I] Data transfers: Enabled
[05/19/2022-09:40:27] [I] Spin-wait: Disabled
[05/19/2022-09:40:27] [I] Multithreading: Disabled
[05/19/2022-09:40:27] [I] CUDA Graph: Disabled
[05/19/2022-09:40:27] [I] Separate profiling: Disabled
[05/19/2022-09:40:27] [I] Time Deserialize: Disabled
[05/19/2022-09:40:27] [I] Time Refit: Disabled
[05/19/2022-09:40:27] [I] Skip inference: Disabled
[05/19/2022-09:40:27] [I] Inputs:
[05/19/2022-09:40:27] [I] === Reporting Options ===
[05/19/2022-09:40:27] [I] Verbose: Disabled
[05/19/2022-09:40:27] [I] Averages: 10 inferences
[05/19/2022-09:40:27] [I] Percentile: 99
[05/19/2022-09:40:27] [I] Dump refittable layers:Disabled
[05/19/2022-09:40:27] [I] Dump output: Disabled
[05/19/2022-09:40:27] [I] Profile: Disabled
[05/19/2022-09:40:27] [I] Export timing to JSON file:
[05/19/2022-09:40:27] [I] Export output to JSON file:
[05/19/2022-09:40:27] [I] Export profile to JSON file:
[05/19/2022-09:40:27] [I]
[05/19/2022-09:40:27] [I] === Device Information ===
[05/19/2022-09:40:27] [I] Selected Device: NVIDIA GeForce GTX 1050 Ti
[05/19/2022-09:40:27] [I] Compute Capability: 6.1
[05/19/2022-09:40:27] [I] SMs: 6
[05/19/2022-09:40:27] [I] Compute Clock Rate: 1.4175 GHz
[05/19/2022-09:40:27] [I] Device Global Memory: 4040 MiB
[05/19/2022-09:40:27] [I] Shared Memory per SM: 96 KiB
[05/19/2022-09:40:27] [I] Memory Bus Width: 128 bits (ECC disabled)
[05/19/2022-09:40:27] [I] Memory Clock Rate: 3.504 GHz
[05/19/2022-09:40:27] [I]
[05/19/2022-09:40:27] [I] TensorRT version: 8.2.1
[05/19/2022-09:40:27] [I] [TRT] [MemUsageChange] Init CUDA: CPU +158, GPU +0, now: CPU 169, GPU 116 (MiB)
[05/19/2022-09:40:27] [I] [TRT] Loaded engine size: 7 MiB
[05/19/2022-09:40:27] [E] Error[6]: The engine plan file is generated on an incompatible device, expecting compute 6.1 got compute 7.5, please rebuild.
[05/19/2022-09:40:27] [E] Error[4]: [runtime.cpp::deserializeCudaEngine::50] Error Code 4: Internal Error (Engine deserialization failed.)
[05/19/2022-09:40:27] [E] Failed to create engine from model.
[05/19/2022-09:40:27] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8201] # /home/mle/TensorRT-8.2.1.8/bin/trtexec --loadEngine=mobilenetv2.engine
Environment
TensorRT Version: 8.2.1.8
GPU Type: NVIDIA GeForce GTX 1650 Ti with Max-Q Design
Nvidia Driver Version: 495.29.05
CUDA Version: 10.2
CUDNN Version: 8.2
Operating System + Version: Ubuntu 20.10
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):