Description
Hi Community,
I am trying to deploy an ONNX model with dynamic shapes. Concretely, the model’s first dimension (the explicit batch size) is dynamic — for simplicity, I will continue to refer to it as “batch size.” The maximum batch size is 8.
To get the best performance, I created 4 optimization profiles and used them to build a TensorRT engine/plan.
For example, the configuration would be:
Optimization Profile 0: range [1, 2]
Input 1: min dim [1, d1, d2, d3], opt dim [2, d1, d2, d3], max dim [2, d1, d2, d3]
Input 2: min dim [1, d4, d5, d6], opt dim [2, d4, d5, d6], max dim [2, d4, d5, d6]
...
Optimization Profile 1: range [2, 4]
Input 1: min dim [2, d1, d2, d3], opt dim [4, d1, d2, d3], max dim [4, d1, d2, d3]
Input 2: min dim [2, d4, d5, d6], opt dim [4, d4, d5, d6], max dim [4, d4, d5, d6]
...
Optimization Profile 2: range [4, 6]
Input 1: min dim [4, d1, d2, d3], opt dim [6, d1, d2, d3], max dim [6, d1, d2, d3]
Input 2: min dim [4, d4, d5, d6], opt dim [6, d4, d5, d6], max dim [6, d4, d5, d6]
...
Optimization Profile 3: range [6, 8]
Input 1: min dim [6, d1, d2, d3], opt dim [8, d1, d2, d3], max dim [8, d1, d2, d3]
Input 2: min dim [6, d4, d5, d6], opt dim [8, d4, d5, d6], max dim [8, d4, d5, d6]
...
The engine builds successfully, but when I try to create the corresponding execution context from that generated engine the creation fails.
Note: this issue does not occur when there is only a single optimization profile — in that case everything works perfectly.
Pseudo code
nvinfer1::ILogger logger;
std::string serilized_engine; // Assume that it has already been built successfully
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* cuda_engine = runtime->deserializeCudaEngine(serilized_engine.data(), serilized_engine.size());
// It would fail here
nvinfer1::IExecutionContext* excution_context = cuda_engine->createExecutionContext();
Error msg:
[executionContext.cpp::ExecutionContext::436] Error Code 2: Internal Error (Assertion (hostMem - hostMemBase) <= totalSharedPerRunnerHost failed. )
[executionContext.cpp::ExecutionContext::436] Error Code 2: Internal Error (Assertion (hostMem - hostMemBase) <= totalSharedPerRunnerHost failed. )
I have searched the entire GitHub repository and reviewed all the documentation, but found no clues as to why this issue occurred.
Could anyone help me resolve this problem and explain the possible cause?
Any suggestions on working with multiple optimization profiles would also be appreciated.
Environment
TensorRT Version: 8.6.1.6
GPU Type: Local - RTX 3060 6G Laptop; Server - A40
Nvidia Driver Version: R535
CUDA Version: 12.1
CUDNN Version: 8.9.3
Operating System + Version: Ubuntu 20.04, kernel 5.15.0-67-generic
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Baremental
Relevant Files
MyelinFusionCreateExecContextCorrup.zip (3.3 MB)
including:
MyelinFusionCreateExeContextCorrup.onnx
MyelinFusionCreateExeContextCorrup.trt
layer_information.json.svg
layer information (generated by Trex)
Steps To Reproduce
Firstly, I have reviewed trtexec (TensorRT 8.6 Tools, GitHub link), and found that in this branch, it does not support multiple optimization profiles. Consequently, it is difficult to reproduce the issue using the standard tool from engine building through execution context creation.
To reproduce the workflow, I used an internal (in‑house) tool that follows the same steps as the official TensorRT samples for creating multiple optimization profiles and building engines. The in‑house tool performs the following high‑level steps:
-
Parse the network/model and create a Builder and NetworkDefinition.
-
Create multiple IOptimizationProfile objects and configure input dimension ranges for each profile.
-
Add the profiles to the BuilderConfig and build/serialize the engine.
I cannot share the in‑house tool itself, but its implementation adheres to the official TensorRT samples and API usage patterns for multi‑profile engine creation. I can, however, provide the engine plan file so you can load the engine with trtexec.
So you can loadEngine by trtexec:
./trtexec --loadEngine={working_dir}/MyelinFusionCreateExeContextCorrup.trt
And get the error info:
[I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[I]
[I] TensorRT version: 8.6.1
[I] Loading standard plugins
[I] Engine loaded in 0.002954 sec.
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] Engine deserialized in 0.0363495 sec.
[E] Error[2]: [executionContext.cpp::ExecutionContext::436] Error Code 2: Internal Error (Assertion (hostMem - hostMemBase) <= totalSharedPerRunnerHost failed. )
[E] Unable to create execution context for stream 0.
[E] Inference set up failed
Miscellaneous
Suspected Myelin Fusion Bug
(more details on the attached SVG file)
Using the inspector, I found that all nodes in the network were fused into a single Myelin layer.
Intentionally breaking that fusion (adding a custom plugin node to disable fusion) eliminated the bug.
Switching to an A40 device did not resolve the issue—the bug persists.