Execution context creation fails with multiple optimization profiles

Description

Hi Community,

I am trying to deploy an ONNX model with dynamic shapes. Concretely, the model’s first dimension (the explicit batch size) is dynamic — for simplicity, I will continue to refer to it as “batch size.” The maximum batch size is 8.

To get the best performance, I created 4 optimization profiles and used them to build a TensorRT engine/plan.

For example, the configuration would be:

Optimization Profile 0: range [1, 2]
Input 1: min dim [1, d1, d2, d3], opt dim [2, d1, d2, d3], max dim [2, d1, d2, d3]
Input 2: min dim [1, d4, d5, d6], opt dim [2, d4, d5, d6], max dim [2, d4, d5, d6]
...

Optimization Profile 1: range [2, 4]
Input 1: min dim [2, d1, d2, d3], opt dim [4, d1, d2, d3], max dim [4, d1, d2, d3]
Input 2: min dim [2, d4, d5, d6], opt dim [4, d4, d5, d6], max dim [4, d4, d5, d6]
...

Optimization Profile 2: range [4, 6]
Input 1: min dim [4, d1, d2, d3], opt dim [6, d1, d2, d3], max dim [6, d1, d2, d3]
Input 2: min dim [4, d4, d5, d6], opt dim [6, d4, d5, d6], max dim [6, d4, d5, d6]
...

Optimization Profile 3: range [6, 8]
Input 1: min dim [6, d1, d2, d3], opt dim [8, d1, d2, d3], max dim [8, d1, d2, d3]
Input 2: min dim [6, d4, d5, d6], opt dim [8, d4, d5, d6], max dim [8, d4, d5, d6]
...

The engine builds successfully, but when I try to create the corresponding execution context from that generated engine the creation fails.

Note: this issue does not occur when there is only a single optimization profile — in that case everything works perfectly.

Pseudo code

nvinfer1::ILogger logger;
std::string serilized_engine; // Assume that it has already been built successfully

nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* cuda_engine = runtime->deserializeCudaEngine(serilized_engine.data(), serilized_engine.size());

// It would fail here
nvinfer1::IExecutionContext* excution_context = cuda_engine->createExecutionContext();

Error msg:

[executionContext.cpp::ExecutionContext::436] Error Code 2: Internal Error (Assertion (hostMem - hostMemBase) <= totalSharedPerRunnerHost failed. )
[executionContext.cpp::ExecutionContext::436] Error Code 2: Internal Error (Assertion (hostMem - hostMemBase) <= totalSharedPerRunnerHost failed. )

I have searched the entire GitHub repository and reviewed all the documentation, but found no clues as to why this issue occurred.

Could anyone help me resolve this problem and explain the possible cause?

Any suggestions on working with multiple optimization profiles would also be appreciated.

Environment

TensorRT Version: 8.6.1.6
GPU Type: Local - RTX 3060 6G Laptop; Server - A40
Nvidia Driver Version: R535
CUDA Version: 12.1
CUDNN Version: 8.9.3
Operating System + Version: Ubuntu 20.04, kernel 5.15.0-67-generic
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag): Baremental

Relevant Files

MyelinFusionCreateExecContextCorrup.zip (3.3 MB)

including:

MyelinFusionCreateExeContextCorrup.onnx

MyelinFusionCreateExeContextCorrup.trt

layer_information.json.svg

layer information (generated by Trex)

Steps To Reproduce

Firstly, I have reviewed trtexec (TensorRT 8.6 Tools, GitHub link), and found that in this branch, it does not support multiple optimization profiles. Consequently, it is difficult to reproduce the issue using the standard tool from engine building through execution context creation.

To reproduce the workflow, I used an internal (in‑house) tool that follows the same steps as the official TensorRT samples for creating multiple optimization profiles and building engines. The in‑house tool performs the following high‑level steps:

  • Parse the network/model and create a Builder and NetworkDefinition.

  • Create multiple IOptimizationProfile objects and configure input dimension ranges for each profile.

  • Add the profiles to the BuilderConfig and build/serialize the engine.

I cannot share the in‑house tool itself, but its implementation adheres to the official TensorRT samples and API usage patterns for multi‑profile engine creation. I can, however, provide the engine plan file so you can load the engine with trtexec.

So you can loadEngine by trtexec:

./trtexec --loadEngine={working_dir}/MyelinFusionCreateExeContextCorrup.trt

And get the error info:

[I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[I] 
[I] TensorRT version: 8.6.1
[I] Loading standard plugins
[I] Engine loaded in 0.002954 sec.
[I] [TRT] Loaded engine size: 3 MiB
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[I] Engine deserialized in 0.0363495 sec.
[E] Error[2]: [executionContext.cpp::ExecutionContext::436] Error Code 2: Internal Error (Assertion (hostMem - hostMemBase) <= totalSharedPerRunnerHost failed. )
[E] Unable to create execution context for stream 0.
[E] Inference set up failed

Miscellaneous

Suspected Myelin Fusion Bug

External Media

(more details on the attached SVG file)

Using the inspector, I found that all nodes in the network were fused into a single Myelin layer.

Intentionally breaking that fusion (adding a custom plugin node to disable fusion) eliminated the bug.

Switching to an A40 device did not resolve the issue—the bug persists.

Hi @wcomaqsw ,

Excellent debugging on this.

Myelin is miscalculating the host memory (totalSharedPerRunnerHost) required for the massively fused layer when it has to juggle multiple optimization profiles. When you call createExecutionContext(), it realizes it needs more memory than it initially budgeted for, triggering the assertion failure. As you noticed, breaking the fusion forces TensorRT to fall back to standard kernels, which calculate their memory correctly and bypass the bug.

What I can recommend for this:

  • TRT 8.6.1 is older. Upgrading to a newer release (9.x or 10.x) is the most permanent fix, as these internal Myelin memory bugs are frequently patched in newer builds.

  • Your solution of breaking the fusion (e.g., inserting a dummy plugin or Identity layer) is the standard tactical fix if the slight performance drop is acceptable.

  • Alternatively, you can build 4 separate single profile engines. Load all of them into memory and route your inputs to the correct engine at runtime based on the batch size.

Please let me know if upgrading solves it or not.

I can contact internal teams if this is a legitimate issue.

Thanks for Posting.

1 Like

@athkumar

Thank you for your response — it confirms my understanding.

Regarding the suggested solutions, here are my thoughts:

  • Solution 1: Our current development target is primarily Orin, and aligning versions on x86 would be additional work, so we cannot upgrade TRT in the short term.

  • Solution 2: This is indeed the most practical tactical fix at the moment, but with our TRT version the effect of breaking the fusion varies by device (e.g., Device A does not reproduce the bug, while Device B still shows the issue under the same network).

  • Solution 3: Building four separate single-profile engines is quite costly in terms of GPU memory. In general, having multiple optimization profiles within a single execution context consumes less memory.

For engineering robustness and to avoid per-device debugging, I plan to handle this model specially by falling back to a single large optimization profile that covers all dynamic ranges, accepting some inference performance loss for greater stability.

Thanks again for your help — I will try upgrading later and report back.

Regards

1 Like